Network Monitoring Design Philosophy

1. Overview:

Monitoring helps network and systems administrators identify possible issues before they affect business continuity and to find the root cause of problems when something goes wrong in the network. Be it a small business with less than 50 nodes or a large enterprise with more than 1000 nodes, continuous monitoring helps to develop and maintain a high performing network with little downtime.

For network monitoring to be a value addition to a network, the monitoring design should adopt basic principles. For one, a monitoring system should be comprehensive and cover every aspect of an enterprise, such as the network and connectivity, systems as well as security. It would also be preferable if the system provides a single-pane-of-glass view into everything about the network and includes reporting, problem detection, resolution, and network maintenance. Further, every monitoring system should provide reports that can cater to a different level of audiences—the network and systems admin, as well as to management such as CEO, CIO, and CTO. Most importantly, a monitoring system should not be too complex to understand and use, nor should it lack basic reporting and drill down functionalities.

2. FCAPS:

Network management is an extensive field that includes various functions. The various objectives of network management are classified and grouped into five different categories, namely Fault management (F), Configuration management (C), Accounting management, Performance management (P) and Security management (S)—together known as FCAPS. In networks where billing is not needed, accounting is replaced with administration.

Fault management deals with the process of recognizing, isolating, and resolving a fault that occurs in the network. Identification of potential network issues also fall under Fault management.

Configuration management involves collection and storage of configuration from various network devices, and includes tracking changes to a device configuration. Because many network issues are due to configuration changes gone wrong, this can be considered an important contribution to proactive network management and monitoring.

Accounting applies to service-provider networks where network resource utilization is tracked and then the information is used for billing or charge-back. In networks where billing does not apply, accounting is replaced with administration, which refers to administering end-users in the network with passwords, permissions, etc.

Performance management involves managing overall network performance. Data for parameters associated with performance, such as throughput, packet loss, response times, utilization, etc., are collected mostly using SNMP.

Security is another important area of network management. Security management in FCAPS covers the process of controlling access to resources in the network which includes data as well as configurations and protecting user information from unauthorized users.

3. Reporting and Alerts:

The basic components of network monitoring are the collection of data from network elements and the processing & presentation of the collected data in a user understandable format. This process itself can be referred to as reporting. Reporting helps the network admin understand the performance of network nodes, current status of the network, and what is normal in the network. With data from reports, an administrator can make informed decisions for capacity planning, network maintenance, troubleshooting and network security.

Reporting alone would not help an admin to maintain a high performance network. Another important requirement is the ability to identify what can go wrong within the network. While reports help understand what is normal and the current status of the network, alerts based on thresholds & trigger points help a network administrator identify possible network issues related to performance and security before they bring down the network. Alerts and reports complement each other such that, alerts let the administrator know of potential problems and reports provide data to identify the root cause for network issues.

4. Alerting:

Every network has a baseline which describes what is normal in the network as far as network performance and network behavior is concerned. The baseline for each network differs from one another. When the values pertaining to a parameter change from an established baseline value, it has the potential to become an issue that can affect network uptime. In such scenarios, alerting based on the deviation from the mean value can help with early detection and resolution of issues, which in turn contributes towards the smooth functioning of the network with less or no downtime. Alerting helps administrators find what can possibly go wrong in the network in relation to performance and security. There are various options based on which an alert can be generated. Here are a few terms associated with alerts:

  1. Triggers:
    Trigger refers to the event that causes an alert to be generated. An event here can refer to the change in state of a node or a value related to the node, deviation from mean value of a parameter, crossing the threshold value of a parameter, and so on.
  2. Thresholds, repeat-count, and time delays:
    Most alerts are set to be generated based on thresholds. When the baseline value related to a network parameter is crossed, a threshold violation occurs and this can be set to trigger an alert. Alerts can also set to be generated when thresholds are violated based on repeat count and time (eg.2 times in 10 minutes).
  3. Reset:
    An alert that is generated based on a threshold violation will reset when value of the parameter that triggered the alert returns to its baseline value.
  4. Suppression and de-duplication:
    Certain threshold violations are expected even though they cross a threshold value. In such cases, alerts are suppressed. In other cases, the same event may cause a threshold violation to occur on multiple events, which in turn will trigger multiple alerts. To prevent such alert triggers, monitoring systems support de-duplication or even consolidation of alerts based on the event that triggered it.

5. Data storage aggregation:

Monitoring systems collect and use data from network elements for various monitoring related functions. Networks also need continuous monitoring to ensure that problems are detected before they cause network downtime. Continuous collection for monitoring leads to an accumulation of large volumes of data. This can lead to:

  • A slow-down in the performance of the monitoring solution as the tool has to analyze more data to generate required reports
  • Impact on the storage space required to store monitoring data, which in turn increases the Total Cost of Ownership of the monitoring system
  • Slower troubleshooting due to the larger volume of data to be analyzed

Monitoring systems make use of data aggregation to avoid the above mentioned scenarios. Data aggregation is the process in which information gathered over time is summarized and rolled up into less granular data and used for quicker generation of historical reports. The granularity of a report generated from aggregated data will depend on the aggregation pattern of the monitoring system. Many monitoring systems start with storing data in 1 minute granularity. Over time the data is averaged out and rolled up into less granular data tables, like every 10 minutes, hourly, or weekly tables. This allows a monitoring system to generate reports about a node in the network that can go back in time or spans a large time period with no performance issues and strain on storage space requirements.

6. Overview of agent based monitoring

Network and systems monitoring tools are either agent based, agentless, or a combination of both. An agent is a software on a monitored device that has access to the performance data of the device. This data is then sent to a NMS system based on requests triggered from the NMS or in some cases, based on polices defined within the agent. The presence of an agent on the monitored device provides access to granular data which in turn helps with better monitoring, reporting, and troubleshooting of issues.

The most common approach for an agent based monitoring system is to provide data to the NMS at set intervals. The presence of an agent allows the monitoring station to perform specific actions on the client that aid with better management and monitoring.

Agent based monitoring provides advantages, such as more granular data, the capability to monitor even non-standard metrics on the device, and the ability to perform actions on the monitored device. But an agent based approach can also be time consuming as it requires agents to be installed on each device that has to be monitored, as well as additional tasks related to update and maintenance of all agents that are deployed in the network.

7. Overview of agentless monitoring:

Agentless monitoring as the name suggests lacks an agent that is deployed on the monitored device. Instead, it makes use of remote APIs that are exposed by the service that needs to be monitored or by analyzing data packets being transferred to and from the monitored device. SNMP is the most common agentless method used to monitor network elements, while WMI (Windows Management Instrumentation) is used to monitor Windows systems.

Agentless monitoring provides advantages, such as not having the need to deploy agents on each monitored device, lower deployment & maintenance costs, and almost zero impact on the client due to the absence of an agent application or software running on it. But agentless monitoring has its set of disadvantages too. The most important one being lack of in-depth reports, compared to what agent based monitoring can provide. Agentless monitoring is also limited by the support it can provide for custom built devices or servers that have MIBs or data that is not exposed via API's for agentless data collection methods.

8. Tips / Resources

  1. The Myth of the 5 9's
  2. SNMP VS WMI
{{STATIC CONTENT}}
{{CAPTION_TITLE}}

{{CAPTION_CONTENT}}

{{TITLE}}