Having a network management and monitoring strategy in place for a network is as important as network design and implementation. Without network monitoring, a well-planned and designed network can be brought down by the smallest of issues.
When implementing a network monitoring solution, there are a few common practices that are followed by organizations and network administrators. These common practices help define a basic strategy to get started with information on the nodes and parameters that need to be monitored.
This does not mean that network monitoring is limited to these common practices alone. The common practices define the basics that are a part of network monitoring. In addition to the common practices specified here, the network admin has to understand the design and requirements of the network they own and be able to implement additional monitoring strategies to bring all metrics and elements in the network under their purview.
2. Availability monitoring:
Availability monitoring defines the monitoring of all resources in the IT infrastructure to ensure they are available to cater to the requirements of the organization and its users. Today’s IT infrastructure requires 100% uptime to meet the business demands. The network and services offered in the network need to be available at all times to ensure business continuity. This is where availability monitoring can help. Continuous monitoring of resources and services ensures that the node or service is up and running and available to meet requirements. Some examples of availability monitoring include monitoring devices in the network to ensure the network is trouble free, bandwidth availability to ensure data delivery, availability of storage space to store organizational data, monitoring system level services to ensure enterprise critical applications are functioning smoothly, etc.
Some commonly used technologies for availability monitoring are:
- Ping: The most widely used method. ICMP pings are sent to a monitored device and based on the replies, the availability of a device or service is measured
- Telnet: Used to check the device availability in networks where ping is blocked
- SNMP: Used to measure availability or current status of a service on a device
- WMI: Used to check the availability of services running on Windows systems
- IPSLA: Cisco feature that can measure availability of WAN links and their capacity to carry specific services
3. Interface monitoring:
There are a multiple types of interfaces used in a network, such as Fast Ethernet and Gigabit Ethernet to the very high-speed Fiber channel interfaces. The interface on a device is the entry and exit point for packets that provide a service to the organization. If there is an error, packet loss, or even if the interface itself goes down, it can result in a poor quality of experience.
Interface monitoring involves monitoring the interfaces on a device for errors, packet loss, discards, utilization limits, etc. The information from interface monitoring will help identify possible network issues that are the cause of poor application or service performance.
Network monitoring systems make use of ping or SNMP to collect interface statistics from network devices. While ping using ICMP packets reports on interface stats, such as packet loss, Round Trip Time, etc., SNMP based data collection helps monitor interface bandwidth utilization, traffic speed on the interface, errors, discards, etc. Together, this information helps identify application performance issues in the network.
4. Disk monitoring:
Data or information is one of the most important resources for an organization. Organizations need data for business planning, as well as its smooth functioning. The data that is needed by an organization also has to be stored for records use or for later use. In enterprises, data is collected and stored on storage arrays that have multiple disks. Any issues that arise on disks or the storage arrays that store business data can have serious consequences on business continuity.
Disk monitoring includes proper management of disk space for effective space utilization, monitoring disk performance for errors, large file stats, free space and changes to disk space usage, I/O performance, etc. Monitoring allows admins to plan in advance for upgrades to the system, as well as the space, detection of storage related problems, and reduction in downtime if an issue occurs.
A network involves many hardware devices, such as devices used for routing & switching, storage, connectivity, application servers, etc. The hardware forms the backbone of the entire IT infrastructure. If a hardware critical to the day to day operations of the network goes down, that also will lead to network downtime. For example, a faulty power supply on the core switch or over heating of the edge router can cause a network outrage. To ensure the smooth functioning of the network, it is important to monitor the health and performance of hardware devices in the network.
To understand details about hardware health, there are multiple metrics that need to be monitored. Here are a few important metrics and why they should be monitored:
- CPU: Tasks for a device are handled by its CPU. If the CPU utilization reaches its maximum value, the device performance can take a hit
- Temperature: When tasks are performed, the CPU usage of a device too can increase. This in turn can increase the temperature. Temperature shoot-ups can cause a device to malfunction thus bringing down the network
- Fan speed and status: Temperature and fan performance go hand-in-hand. Fan speed monitoring helps ensure the fan is working and even balances cooling, thus keeping the device temperature at its optimum value
- Power supply state: A faulty power supply or a spike in power to a device can cause it to malfunction, and ultimately leading to downtime. Monitoring with alerts based on thresholds helps an admin find potential issues