A Best Practice Guide to Monitoring and Observability

Monitoring and Observability Guides

Introduction

Traditional monitoring cannot diagnose complex performance issues across today's hybrid environments, which span on-premises hardware, cloud services, and software-defined WAN (SD-WAN) links.

Network observability becomes essential for correlating metrics, logs, flows, and events to identify the root cause of incidents and their origin. The inclusion of observability in traditional monitoring efforts enables teams to resolve issues before users become aware of them.

End-user experience monitoring is another benefit of advanced network monitoring. Device metrics alone don't reveal what users experience. Performance must be measured by how well the network delivers applications, not just uptime statistics.

This article outlines best practices for implementing network observability across modern hybrid infrastructures.

Summary of key network monitoring best practices

Best Practices	Description
Design your monitoring architecture	Balance security needs with analytics capabilities when choosing between self-hosted, cloud, or hybrid deployments. Establish baselines to set meaningful alerts, detect gradual performance degradation, and distinguish normal traffic from potential DDoS attacks.
Collect, correlate, and analyze network data	Combine metrics, flow data, and logs for complete visibility. Correlate events on unified timelines to identify patterns, such as flapping routes causing application timeouts. Flow analysis reveals whether bandwidth spikes come from backups or data exfiltration.
Implement intelligent operations	Artificial intelligence for IT operations (AIOps)-powered capabilities such as anomaly-based alerting, alert correlation, and stacking require 30 – 90 days to learn your patterns and reduce hundreds of alerts to the few that matter. Combine with infrastructure-as-code (IaC) automation to detect, fix, and verify issues automatically. Start with read-only automation before enabling closed-loop remediation.
Monitor hybrid infrastructure	Track performance across on-premises, cloud, containers, and SD-WAN from a single platform. Focus on what users experience. Perfect infrastructure metrics are less meaningful if the user experience is degraded.
Manage configuration and compliance	Version-control every change and correlate configuration modifications with outages to speed troubleshooting. Implement zero-trust access with automated auditing for compliance. Most outages can be traced back to configuration changes or excessive access rights.
Plan for growth	Track utilization trends to identify when resources will hit critical thresholds based on current growth rates. Factor in business changes, such as cloud migrations, which can shift traffic patterns overnight. Plan 18 months ahead to ensure the network enables, rather than constrains, growth.

Design your monitoring architecture

Design your monitoring architecture to strike a balance between security needs and analytics capabilities.

Choose your deployment model

The first step is determining the appropriate monitoring system deployment model based on your network environment and your organization's technical and business requirements.

A self-hosted solution, deployed within the corporate environment, monitors the network from within the firewall and offers deep visibility into private infrastructure without requiring inbound connections that compromise security. It's ideal for organizations with stringent compliance requirements or those that prefer to maintain sensitive data in-house.

Cloud-based SaaS monitoring platforms eliminate infrastructure costs and maintenance overhead. The scalability challenges that come with on-premises deployments, such as storage limitations, processing bottlenecks, and the cost of high-availability infrastructure, are handled by the provider. SaaS solutions process millions of flow records or correlate events across thousands of devices in a more economically feasible way. On the other hand, building equivalent on-premises infrastructure would require significant capital investment and specialized expertise.

When cloud-connected, on-premises monitoring solutions can perform predictive analytics and anomaly detection locally, even when air-gapped, and leverage broader data sets and more sophisticated models for enhanced insights. Many organizations combine the depth of internal monitoring for sensitive systems with cloud solutions for advanced analytics.

Establish performance baselines

Understanding what's "normal" in your network is essential for detecting anomalies. Without baselines, you can't distinguish between expected behavior and potential problems. Consider these scenarios:

An admin receives 10 alerts in one minute about high interface utilization. Is this an incident or normal backup traffic?
Management wants to run stress tests "when the network is quiet," but when exactly is that?
A bandwidth spike occurs every Monday at 1 a.m. Is this planned maintenance or a misconfiguration?

A baseline defines healthy performance for your environment by tracking metrics such as latency, bandwidth utilization, error rates, and device CPU/memory usage over time. The baseline should reflect normal patterns across daily, weekly, and even seasonal cycles.

Key metrics to baseline include:

Metric	Description
Bandwidth utilization	Average and peak usage, especially on WAN/Internet links
Latency	Delay in data transmission from source to destination
Packet loss	Percentage of packets failing to reach the destination
Jitter	Variation in packet delay
Device performance	CPU and memory utilization patterns

Platforms such as SolarWinds^® Observability, with its Network Performance Monitoring capabilities, automatically collect and store historical data for baseline analysis. For example, you can display average traffic per hour and weekday to identify traffic patterns. This reveals patterns such as: "daily backup jobs at 21:00," "weekend cloud replication at 01:00 Sunday," or "unexpected traffic spikes that need investigation."

Traffic baseline showing daily patterns graph

Traffic baseline showing daily patterns. Notice the regular 21:00 backup spike and the unexpected Monday 01:00 spike that revealed a misconfiguration (source).

The chart above illustrates how baselines reveal both expected patterns (daily VM backups at 21:00 and Sunday cloud backups at 01:00) and anomalies (an unexpected Monday spike indicating a misconfiguration). Once established, baselines serve multiple purposes:

Intelligent alerting: Combine baseline deviations with absolute thresholds and duration requirements (e.g., alert when traffic exceeds baseline AND hits 80% for 5+ minutes).
Capacity planning: Track growth trends over time.
Troubleshooting: Quickly identify when metrics deviate from normal.
Change validation: Verify changes don't negatively impact

Collect, correlate, and analyze network data

Combine metrics, flow data, and logs for complete visibility.

Consume a variety of data sources

Networks are made of various devices, including switches, routers, firewalls, load balancers, and cloud services, and each one speaks its own language. Some provide event logs, others expose metrics through APIs, and many still rely on classic protocols, such as SNMP. Many network devices provide flow-based exports that reveal traffic patterns across the network.

No single protocol provides complete visibility. For example, SNMP shows device status and interface counters, while flow data indicates bandwidth usage and traffic sources, and logs reveal security incidents or configuration changes. Since devices expose different types of data, monitoring solutions must use their native protocols and correlate them.

The table below illustrates some of the protocols and methods employed by monitoring systems to collect data from various network devices, tailored to the capabilities of these devices.

Protocol/Method	What It Provides	Strengths	Limitations
SNMP	Device and interface metrics (CPU, memory, bandwidth, and errors)	Widely supported, standardized, low overhead	Limited granularity, polling-based, may miss transient events
Telemetry	Streaming (device pushes data)	Real-time, efficient at scale, flexible	Requires newer device support, more complex setup
SSH	Command-line access to device metrics and configurations	Secure, flexible, deep visibility	Higher overhead, needs credentials, not scalable for high-frequency polling
Audit trails	Historical record of configuration and access changes	Provides accountability, supports compliance, links cause to effect	Needs integration with other monitoring sources to show the full impact
API	Metrics from modern systems and cloud services	Real-time, flexible, integrates with SaaS and virtualization	Vendor-specific implementations require maintenance
Syslogs	Event messages from devices (errors, security events, and configuration changes)	Real-time, detailed, widely supported	High volume, requires parsing, and is noisy without filtering
Traffic flows	Detailed view of traffic conversations and patterns	Visibility into who/what uses bandwidth, supports analytics	Potentially resource-intensive, sampled data (sFlow) may miss details

Examples of data sources when monitoring network devices.

Correlate across data sources

Correlating information from multiple sources into a single unified dashboard or user view requires a system that:

Understands the content of each protocol or method.
Parses the content or message.
Presents it on the same timeline with a unified scale.

The system should enable users to create their own widgets and dashboards tailored to the specific needs of each team.

One powerful way to correlate different sources during troubleshooting is by aligning them on the same timeline. When events from logs, audit trails, flows, and SNMP metrics are compared chronologically, patterns emerge. A bandwidth spike might coincide with an interface error or routing change. This reveals cause-and-effect relationships invisible in isolated data.

The diagram below shows the PerfStack™ feature of SolarWinds Observability in action. It combines metrics from different sources and aligns them by timestamp to the same timeline. Administrators can drag and drop metrics from the left menu and add them to the common timeline, creating a visual analysis of multiple readings simultaneously. This way, they can see how metrics are related and quickly identify the root cause of an issue.

PerfStack analysis from SolarWinds provides cross-domain correlation of data (source).

Analyze network traffic

Network traffic analysis is conducted for both performance management and security purposes. Network devices export flow data, or inline appliances capture it directly. By collecting these records, IT teams can answer questions such as, "Who consumes the most bandwidth?" or "Which application triggered a sudden traffic spike?"

Devices export traffic data in various ways. NetFlow, sFlow, IPFIX, and J-Flow are among the most common standards, each offering different levels of detail, scalability, and vendor support:

Protocol	Origin	Method	Key Advantages	Limitations
NetFlow	Cisco	Records flows (Versions 5 and 9)	Widely supported, detailed flow information, good for traffic analysis	Vendor-specific (though widely adopted), higher overhead at scale
sFlow	Multi-vendor	Uses packet sampling	Scales extremely well on high-speed links, low overhead	Provides sampled vs. full data (less precise), may miss short flows
IPFIX	IETF	Records flows with flexible templates	Open standard, extensible, vendor-neutral	More complex to configure, not all devices fully support extensions
J-Flow	Juniper Networks (acquired by HPE 2024 – 2025)	Records flows	Integrated into Juniper devices, similar to NetFlow	Limited outside the Juniper ecosystem
Cloud / virtual private cloud (VPC) flow logs	Amazon Web Services (AWS) (VPC), Azure (NSG), Google Cloud Platform (VPC)	Offers native cloud logging	Easy SaaS integration, scalable, no device overhead	Cloud-specific formats, vendor lock-in, limited on-premises visibility

Comparing traffic monitoring protocols in network devices.

Inline monitoring appliances inspect packets directly. This approach provides detailed visibility but comes with a few challenges:

High traffic volumes can overwhelm collectors.
Encryption can block payload inspection.
Inline devices must be carefully deployed to avoid creating single points of failure.

Network analytics is valuable because it enables the interpretation of data in context. Flow data can highlight abnormal traffic patterns that indicate a misconfiguration, a failing link, or even a security incident such as data exfiltration. When combined with other monitoring sources, flow analysis reveals whether high utilization is associated with legitimate business applications or unexpected activity.

Implement intelligent operations

As networks become increasingly complex, traditional alerting often generates more noise than value. IT teams can be overwhelmed by redundant or low-priority notifications, making it more difficult to identify the real issues. This is where AIOps can be particularly beneficial.

Deploy AIOps for smarter monitoring

AIOps refers to the use of AI, machine learning (ML), and advanced analytics to enhance and automate IT operations. In the monitoring context, it typically encompasses four phases:

Observe: The AIOps platform ingests data from multiple sources, including events, metrics, logs, and traces.
Visualize: After ingestion, AIOps analyzes the data to find patterns and separate critical signals from "noise" and presents insight through dashboards and reports.
Remediate: The platform identifies issues and generates alerts or tickets to notify teams of problems requiring attention.
Automate: Based on analysis, the system triggers automated responses using scripts and runbooks to resolve known issues without human intervention.

AIOps platform enabling continuous insight across the stack.

Using AIOps means collecting a large volume of operational data to perform advanced data analytics. Four types of analytics can be performed on this data:

Analytics Type	Description
Descriptive	"What happened on the network?" This analyzes collected data to show historical and real-time status, including visual dashboards, periodic reports, and alerts/events generated based on the collected data.
Diagnostic	"Why did it happen?" This helps with root-cause analysis.
Predictive	"What is likely to happen on the network?" This helps with predictive maintenance and in forecasting potential issues before they occur. For example, it utilizes historical data from network devices to identify when a device is expected to be operational.
Prescriptive	"What should be done to fix or optimize the network?" This type of analytics can help recommend or automate corrective actions.

AIOps transforms monitoring from reactive alerting to proactive intelligence. Instead of static thresholds that generate alert storms, AIOps learns your network's normal behavior patterns—traffic flows during business hours, backup spikes at night, and seasonal peaks—then detects meaningful deviations. Tools such as AlertStack correlate hundreds of related alerts into a single incident, showing you the root cause instead of flooding your inbox.

AIOps also spots subtle anomalies humans may miss: gradual latency creep, unusual protocol distributions, or off-hours data movements, which can indicate potential security issues. It also enables predictive management and selective automation. It forecasts when WAN links will reach capacity based on growth trends, identifies devices showing pre-failure patterns (rising cyclic redundancy check errors or memory leaks), and can safely automate routine fixes, such as bouncing high-error interfaces after hours or rebalancing traffic during congestion.

However, implementation requires patience. Most platforms need 30 – 90 days to establish accurate baselines, initial deployments require tuning to reduce false positives, and teams need training to effectively use AI-driven insights. The value comes from transforming noise into intelligence, allowing engineers to focus on strategic work instead of chasing false alarms.

Automate network changes

Most network outages are caused by manual configuration errors. Modern automation combines infrastructure as code (IaC), configuration as code, and monitoring-driven responses to eliminate human error while maintaining consistency across environments.

Capabilities such as Network Configuration Management (NCM) from SolarWinds bridge the gap between monitoring and automation. When integrated with alerting systems, NCM can automatically respond to network issues without human intervention. For example, when monitoring detects a device problem, NCM can:

Back up configurations immediately: NCM captures the current running configuration before any troubleshooting begins, creating a restore point if changes make it worse.
Execute remediation scripts: NCM runs predefined command sequences to address common issues. If high CPU usage is detected, NCM can execute scripts to clear the Address Resolution Protocol cache, disable unused services, or adjust process limits.
Validate recent changes: NCM can compare current configurations against baselines or previous versions to identify what changed before the problem started.

For example, if unusual Border Gateway Protocol (BGP) behavior on a border router is detected, the system triggers an NCM alert action that:

Backs up the current configuration.
Compares recent changes against the baseline.
Executes a diagnostic script (show BGP summary and show IP route changes).
Notifies the team of both the issue and the configuration differences.

If the comparison reveals an incorrect route-map applied the previous day, you immediately know the root cause. NCM can also execute a remediation script to revert the problematic change, though most organizations require human approval for such actions. When integrated with change management systems, every configuration modification is tracked, approved, and auditable.

Implementation should follow a graduated approach. Start with read-only actions, such as automated backups and change detection. Progress to diagnostic scripts that gather information but don't modify configurations. Only after proving reliability should you enable automated remediation, and always with appropriate approval workflows and rollback capabilities. Every automated action must log results, support rollback, and, for multi-device changes, require human approval.

Monitor hybrid infrastructure

Networks are a complex mesh of on-premises hardware, multiple public clouds, container orchestration platforms, and SD-WAN overlays. A single application might run in Kubernetes pods on AWS, connect to databases in Azure, serve users through Cloudflare Content Delivery Network (CDN), and integrate with on-premises legacy systems via SD-WAN. Effective monitoring must span all these domains simultaneously.

Unify multi-domain visibility

Modern network monitoring platforms ingest data from traditional sources (SNMP and NetFlow) alongside cloud-native metrics (Prometheus, CloudWatch, and Azure Monitor) and container insights (Kubernetes metrics-server and service mesh telemetry). Without this unified view, teams can waste hours correlating between disconnected tools, often missing critical dependencies.

For example, a slowdown blamed on the WAN might stem from pod autoscaling delays in Kubernetes or API rate limiting in a cloud service.

Container networking adds increased complexity. Pods appear and disappear, services load balance dynamically, and network policies create microsegmentation. Monitoring must track the underlying host network and overlay networks, service mesh traffic (e.g., Istio and Linkerd), and east-west traffic between microservices. This requires native Kubernetes integration alongside traditional VM-level monitoring.

Multi-cloud visibility is equally important. Organizations typically use AWS for compute, Azure for Active Directory integration, and Google Cloud for analytics. Each cloud has proprietary networking constructs, such as VPCs, virtual networks, and security groups, that must be understood holistically. When latency spikes between regions, teams must quickly determine if it's an AWS Direct Connect issue, Azure ExpressRoute degradation, or something in between.

Focus on end-user experience

Infrastructure metrics reveal what your network is doing, but user experience metrics indicate whether it's actually working.

Traditional monitoring tracks device health, but users don't care if your core switch is at 5% CPU. They care that their video call is stuttering, the ERP system feels sluggish, or file uploads take forever. A network can exhibit perfect traditional metrics, while users experience poor application performance due to TCP window scaling issues, DNS resolution delays, or asymmetric routing problems.

Digital experience monitoring (DEM) bridges this gap by measuring what users are actually experiencing. It usually combines multiple data sources:

Endpoint agents that measure application response from the user's device.
Network path analysis that traces every hop between the user and the service.
JavaScript injected into web applications that captures real browser rendering times.

This multi-layer approach distinguishes between network latency (your responsibility) and application processing time (development's problem).

Aspect for Measurement	Explanation	Example
Application performance	Measure response times for critical business applications, not just ping times to servers.	Track how long it takes to load a dashboard, complete a transaction, or retrieve search results.
Service dependencies	Map the full path from user to application, including CDNs, load balancers, API gateways, and third-party services.	When performance degrades, quickly identify whether the issue is with your network, the application, or an external dependency.
Synthetic monitoring	Continuously test critical user journeys from multiple locations.	Detect problems before users report them by simulating logins, transactions, and workflows to identify potential issues.
Real user monitoring (RUM)	Collect performance data from actual user sessions.	Understand how network conditions affect different user populations, such as remote workers on VPN, branch offices using SD-WAN, or customers accessing public services.

Implementing DEM requires strategic sensor placement. Here are a few best practices to keep in mind:

Deploy monitoring from branch offices, not just data centers, since user experience varies by location.
Monitor during actual business hours in each time zone, not simply when the IT team is awake.
Track the full transaction path. For example, a slow database query looks identical to network latency from the user's perspective but requires completely different remediation.
Set experience-based thresholds that reflect business impact. For example, a 200 millisecond delay might be acceptable for emails but catastrophic for high-frequency trading.
Define service level agreements based on what users need to be productive, not what the network can theoretically deliver.

This user-centric approach reveals issues that traditional infrastructure monitoring often overlooks. By monitoring from the user's perspective, you can catch problems that matter to the business, not just those that trigger alerts.

Manage configuration and compliance

The 2023 Uptime Resiliency Survey highlights that configuration failures (45%) and third-party provider issues (39%) are the leading causes of outages. A single routing update or Access Control List modification can trigger cascading failures hours later, with no obvious connection to the original change.

Audit configuration changes

Modern configuration management combines policy as code (PaC) with continuous auditing. Instead of tracking changes after the fact, IaC ensures all changes are version-controlled, peer-reviewed, and tested before deployment. Git repositories become the single source of truth for network state, making every change traceable to a specific commit, ticket, and approver.

When troubleshooting, configuration audit trails prove invaluable. Monitoring platforms should correlate performance degradation with recent changes, highlighting potential causes. For example, if latency spikes at 2 p.m., the system should immediately show the BGP policy changed at 1:45 p.m. This correlation transforms troubleshooting from guesswork to targeted investigation.

Auditing configuration changes on devices while troubleshooting service issues image

Auditing configuration changes on devices while troubleshooting service issues (source).

The visualization above demonstrates this principle: a configuration change on Router 9 directly caused NetSuite connectivity issues. Proper auditing makes the correlation immediately visible, enabling rapid rollback to restore service.

Effective configuration management requires:

Version control: Every configuration is stored in Git with a commit history.
Automated backups: Configurations are captured before and after each change.
Compliance validation: PaC engines (Open Policy Agent and HashiCorp Sentinel) enable the rejection of noncompliant configurations before deployment.
Drift detection: Continuous comparison is made between running configurations and intended state, alerting on unauthorized changes.

Monitor access rights

Your network is only as secure as your weakest access control. While firewalls guard the perimeter, excessive internal permissions create highways for lateral movement once an attacker gains initial access. Zero trust principles must extend to network management itself. Zero trust asks, for example, "Why should they be able to see financial server configurations or security appliance settings?" Instead, implement role-based access control with surgical precision. Network operators view specific device groups, senior engineers modify approved equipment classes, and only architects access core infrastructure. During maintenance windows, temporarily elevate privileges rather than maintain standing administrative access that attackers could exploit.

Compliance drives access monitoring from a nice-to-have to a legal requirement. For instance, HIPAA mandates tracking who accessed patient data systems, PCI DSS requires quarterly access reviews, and SOC 2 demands continuous authorization monitoring. Manually tracking permissions across hundreds of devices and thousands of users is impossible at scale.

Access rights management platforms, such as SolarWinds Access Rights Manager, automate permission auditing across Active Directory, network devices, and cloud platforms. They generate compliance reports that satisfy auditors while alerting them to excessive permissions or unusual access patterns. Integration with configuration management creates an in-depth defense.

Plan for growth

By tracking utilization trends and analyzing growth patterns, teams can upgrade infrastructure before it becomes a bottleneck. Capacity planning combines historical baselines with trend analysis to identify when resources will reach critical thresholds. Instead of waiting for circuits to saturate, calculate when current growth rates will push links past acceptable limits. A 5% monthly increase in WAN traffic means a link at 60% today will hit 80% in eight months. An upgrade must be planned before an outage occurs.

Resource	Upgrade Threshold	Warning Signs	Considerations
WAN / Internet links	70% sustained utilization (business hours) or 90% daily peaks	TCP retransmissions, queue drops, increasing jitter	Pay extra attention to cloud interconnects (ExpressRoute, Direct Connect) due to rapid growth
Switch port density	<20% ports available per closet	Power over Ethernet capacity limits	Plan for three to five devices per employee, plus expansion for IoT devices
Device CPU/memory	50% for edge devices (Network Address Translation / encryption), 70% for core switches	Microbursts, process crashes, slow command-line interface response	Monitor one-second intervals, not just five-minute averages
Cloud/container resources	80% of quotas/limits	API rate limit errors, pod scheduling failures, elastic IP exhaustion	Track soft limits (namespace quotas and API rates) that cause hidden failures

These thresholds provide starting points, but context matters. A core switch at 70% CPU utilization during planned backup windows is acceptable, while an edge router at 50% CPU utilization during normal operations signals trouble.

Planning requires more than raw metrics. Consider the business context, including upcoming acquisitions, new applications, seasonal patterns, and digital transformation initiatives. A new video conferencing rollout might triple WAN requirements, and a cloud migration might shift traffic patterns entirely. Build expansion plans with 18-month horizons, reviewing quarterly as conditions change. The goal is to make sure the network never constrains business growth.

Final Thoughts

Network observability requires correlation with multiple data sources—flows, metrics, logs, and traces—across increasingly complex hybrid infrastructures. No single data point tells the complete story. When an application slows, the root cause might be a misconfigured cloud security group, an oversubscribed WAN link, or a container hitting memory limits. By correlating these disparate signals, teams can quickly identify and resolve issues that span traditional boundaries.

Success depends on striking a balance between comprehensive visibility and practical action. Collect everything, but use AIOps and automation to surface what matters. Build baselines to understand the normal range, then detect meaningful deviations.

Monitor infrastructure health while prioritizing user experience. Implement automation gradually, starting with safe, read-only operations before progressing to closed-loop remediation. Most importantly, treat observability as an ongoing practice, not a one-time implementation.

Ready to achieve visibility over your entire IT estate?

Learn More