A Best Practice Guide to Monitoring and Observability

Monitoring and Observability Guides

Introduction

Your application is where code meets user intent: where button clicks become database queries, where APIs coordinate workflows, and where slow checkouts cost you customers. When any component degrades, creating situations such as database slowdowns, API rate limits, or memory leaks, symptoms typically appear here first.

Application performance monitoring (APM) is your early warning system. Yet teams often drown in traces while missing critical issues. They chase phantom problems while real errors hide behind misleading "200 OK" responses. Effective APM requires measuring what matters.

The RED metrics framework—Rate (throughput), Errors (failures), and Duration (latency)—provides that foundation. For example, login errors spiking from 0.1% to 5% or checkout times jumping from 2 to 8 seconds (s) signal user-facing issues in real time, not after the customers have left.

The following six field-tested practices help you build an application performance monitoring system that catches issues before customers complain.

Summary of application performance monitoring best practices

Best Practices	Description
Focus on user impact	Explore monitoring efforts on time-series metrics that directly reflect the user’s journey and the business impact: Prioritize revenue-critical paths (login, checkout, and payment) over static pages. Track RED metrics (latency, volume, errors) alongside business key performance indicators. Use health scores and percentiles over averages. Alert on user pain, not system metrics.
Sample where possible	Implement intelligent sampling: Use head-based sampling (SolarWinds default: 100 traces/minute/service). Filter INFO/DEBUG logs in production. Balance visibility needs against cost.
Propagate context	Unlock distributed tracing requires all services to share request identifiers uniformly: Implement World Wide Web Consortium (W3C) trace context headers across all services. Include trace IDs in every log entry to enable distributed tracing and correlate symptoms to root causes across your entire stack.
Consider semantics	Ensure semantic correctness, as it enables automation, machine-learning (ML) detection, and accurate service-level objective (SLO) calculations: Return accurate HTTP status codes (429 for rate limits, not 400). Structure error responses with both machine-readable codes and human-readable messages. Ensure APM tools follow OpenTelemetry (OTEL) standards and capture what you emit. Consider this your application's responsibility, not the APM provider's.
Fail fast but gracefully	Protect your system from cascading failures using timeouts and mechanisms in a controlled, nondisruptive way: Implement circuit breakers, timeout chains, and graceful degradation. Set cascading timeouts appropriate to your architecture. For user-facing APIs, implement cascading timeouts where each layer times out before its caller. Return partial results or cached data instead of error pages when possible.
Monitor in depth and breadth	Achieve holistic monitoring by combining extensive service coverage with the ability to diagnose performance at the code level: Depth: Trace through full call stacks with code-level profiling. Breadth: Cover all services, third-party APIs, and infrastructure. Utilize ML-powered anomaly detection to identify patterns that humans may miss. Start with breadth to detect and use depth to diagnose.

Focus on user impact

Not all transactions are equal. Login and checkout failures cost revenue, while a slow "About Us" page merely annoys users. Prioritize monitoring accordingly. Critical paths tend to be complex, involving databases, message queues, payment gateways, and third-party APIs, where failures cascade into user frustration.

Both health state, a feature of SolarWinds Observability Application Performance Monitoring, and Application Performance Index (Apdex) provide valuable ways to assess the health and performance of your entities or services.

Health state for service health

Health state offers a holistic evaluation by considering an entity’s typical performance baseline, detected anomalies, and triggered alerts. When performance deviates from expected norms, such as through anomaly detection or key metric alerts, the health state reflects this shift in real time. It’s presented in categories—Good (Green), Moderate (Yellow), Bad (Red), or Unknown (Gray)—providing clear visual cues for immediate attention. This method is especially effective for real-time insight into overall service health and quick triage of critical issues.

Apdex

Apdex scores translate performance into user satisfaction ratios. The formula is:

(Satisfied + Tolerating/2) / Total Requests

Satisfied requests are the ones completed within your target threshold (e.g., under 500 milliseconds [ms] for a consumer application). Tolerating requests take up to four times your threshold (500ms – 2s). Frustrated requests exceed the threshold by four times or result in errors. For example, with a 500ms target, a 400ms response counts as Satisfied (score: 1), a 1.5-second response as Tolerating (score: 0.5), and a 3-second response as Frustrated (score: 0).

Set your threshold based on your application's context. Industry benchmarks suggest maintaining an Apdex score above 0.85 for a satisfactory user experience, although your specific threshold depends on user expectations and the competitive landscape. Internal tools might tolerate 2 seconds, while consumer applications demand sub-second response.

Real user monitoring

Real user monitoring (RUM) captures the actual browser experience. Google Core Web Vitals provide research-backed targets:

Interaction to Next Paint (INP): Keep under 200ms for responsive interactions.
Largest Contentful Paint (LCP): Keep main content load under 2.5 seconds.
Cumulative Layout Shift (CLS): Target under 0.1 for visual stability.

Additional performance indicators:

First Contentful Paint (FCP): Under 1.8 seconds signals initial response.
Time to Interactive (TTI): Under 3.8 seconds ensures full interactivity.

While secondary to INP, LCP, and CLS, these metrics help diagnose specific performance bottlenecks.

Every 100ms of latency could cost up to 1% in sales, and delays in search results could cause a significant traffic drop. Track your own correlations between latency and conversion rates to build your business case.

The SolarWinds^® Observability SaaS APM function displays ML-based health indicators with RED metrics, including request rate, error ratio, and average response time.

Modern APM platforms visualize these user impact signals, often referred to as RED (rate, errors, and duration) metrics, in unified dashboards. A good dashboard, such as the SolarWinds APM dashboard, displays these RED metrics at a glance, enabling faster triage and decision-making. The SolarWinds APM dashboard also displays ML-powered health indicators alongside the RED metrics. SolarWinds APM leverages ML to baseline normal behavior for your specific application, then alerts on deviations, a far more actionable approach than arbitrary static thresholds.

Synthetic monitoring

Synthetic monitoring proactively tests critical transactions before users encounter problems. Beyond simple uptime checks, transaction monitoring validates complete user workflows—from login through checkout to payment confirmation.

Synthetic tests provide a consistent way to reproduce issues for developers debugging web applications. When a user reports "checkout sometimes fails," synthetic transaction tests isolate whether the problem stems from specific payment methods, inventory states, or third-party service failures.

Run transaction tests every 5 minutes from multiple geographic locations:

Multi-step user journeys: signup → email verification → profile setup
Critical business transactions: add to cart → apply discount → process payment
API workflow sequences: authenticate → fetch data → update records
Third-party integration flows: payment gateway → fraud check → inventory update

Transaction monitoring catches complex failures that simple endpoint checks miss. A /checkout endpoint might return 200 OK responses while the actual purchase flow fails due to a misconfigured payment gateway. Only full transaction testing reveals these hidden failures.

The SWOPPER (example website) real user monitoring page shows that the Largest Contentful Paint (the slowest-loading object) takes up to 8 seconds on some loads.

The SWOPPER demonstration above shows real impact: an 8-second LCP means users wait four times longer than the threshold Google recommends. Research shows 53% of mobile users abandon sites that take over 3 seconds to load.

Remember that users don't experience averages. Your p50 latency might be 200ms, but if p99 is 10 seconds, 1% of users suffer, potentially thousands per hour. Monitor percentiles (p50, p90, p95, and p99), not just means. Set alerts based on user tolerance and not arbitrary numbers.

Sample where possible

Observability costs scale rapidly with data volume. A single transaction can generate hundreds of spans across dozens of services. Multiply that by millions of requests, and you’re quickly overwhelmed with data and costs. Sampling becomes a crucial technique to preserve visibility while keeping budgets sustainable.

Rather than recording every trace, apply intelligent sampling that aligns with your system's needs and priorities. Use sampling rules to maintain complete visibility for critical transactions, such as errors, high-latency requests, or revenue-impacting endpoints, while reducing data from routine background operations or low-value endpoints.

"agent.transactionSettings": [
    {
        "regex": "CLIENT:GET",
        "tracing": "disabled"
    },
    {
        "regex": "INTERNAL:Task\\.run",
        "tracing": "disabled"
    },
    {
        "regex": "http://localhost.*",
        "tracing": "disabled"
    },
    {
        "regex": ".*/ping",
        "tracing": "disabled"
    }
]

This configuration filters out noncritical traces, such as health checks and internal tasks, allowing your observability system to focus on the transactions that matter most to your business.

Head-based sampling works by making the sampling decision at the request entry point, based on rules such as the sampling rate or request attributes. This approach is more straightforward and predictable than tail-based sampling, where the decision is made after collecting the full trace.

Use sampling rules to maintain complete visibility for critical transactions, such as errors, high-latency requests, or revenue-impacting endpoints, while reducing data from routine background operations or low-value endpoints.

Apply the same discipline to logging

This configuration filters out noncritical traces, such as health checks and internal Production systems should filter aggressively. INFO and DEBUG logs that show "everything is fine" provide minimal value while consuming significant storage. A typical production hierarchy is as follows:

ERROR: Always log for user-impacting failures, lost data, and security events.
WARN: Always log for degraded performance, retry succeeded, and approaching limits.
INFO: Sample or disable for successful operations and routine events.
DEBUG: Disable for production systems and enable selectively when troubleshooting issues.

Avoid logging activity without context. A log entry showing "Request processed" provides less useful information. Instead, structure logs to capture RED metrics:

{
  "level": "ERROR",
  "transaction_id": "abc123",
  "endpoint": "/api/checkout",
  "duration_ms": 8432,
  "status_code": 500,
  "user_id": "user_789",
  "error": "Payment gateway timeout after 5000ms"
}

This single log entry provides actionable information, covering which user was impacted, what failed, how long it took, and most importantly: why it failed.

Balance visibility with cost

Calculate your sampling budget. If you process 10 million requests daily and your APM vendor charges $0.00015 per trace:

100% sampling: $1,500/day ($45,000/month)
10% sampling: $150/day ($4,500/month)
Smart sampling (1% normal, 100% errors): ~$30/day ($900/month)

That budget difference can fund additional engineering headcount to fix problems rather than simply observe them. Sample intelligently, keeping what helps you diagnose issues. However, keep in mind that some regulated industries and scenarios require 100% tracing for compliance. Here are some sample industries to keep in mind:

Financial transactions and payment processing (PCI DSS compliance)
Healthcare operations (HIPAA audit requirements)
Security events (forensic analysis)
User authentication (security auditing)

For these cases, consider separate data streams: 100% logging for compliance and sampled tracing for performance analysis.

Propagate context

Context enables the correlation of signals (traces, logs, and metrics). Propagate context between services to gain a full-stack view of how a request flows through your system, along with the logs and metrics generated by that request.

Implement the W3C trace context standard

The W3C standard defines HTTP headers that carry trace information across service boundaries. Every request gets a traceparent header with four components. For example:

00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

Each component serves a purpose:

00: Version
4bf92f3577b34da6a3ce929d0e0e4736: 128-bit Trace ID
00f067aa0ba902b7: 64-bit Parent Span ID
01: Trace flags (sampled or not)

Here's how to implement trace propagation:

Service A (initiating request)

from opentelemetry import trace
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Setup once at startup
RequestsInstrumentor().instrument()

# Tracing happens automatically
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_order") as span:
    # Headers are automatically injected
    response = requests.post('http://inventory-service/check', 
                            json=order_data)

Service B (receiving request)

from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Setup once at startup
FlaskInstrumentor().instrument_app(app)

@app.route('/check', methods=['POST'])
def check_inventory():
    # Context automatically extracted from headers
    tracer = trace.get_tracer(__name__)
    
    with tracer.start_as_current_span("inventory_check") as span:
        # Use parameterized queries
        cursor.execute(
            "SELECT stock FROM inventory WHERE item_id = ?",
            (request.json['item_id'],)
        )
        return result

Extend context to logs and metrics

Structure your logs to include trace context automatically.

{
  "timestamp": "2024-03-15T10:30:00Z",
  "level": "ERROR",
  "message": "Payment gateway timeout",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "service": "payment-service",
  "duration_ms": 5000
}

The correlation between traces and logs enables powerful debugging. When a user reports an error, search for their trace ID to see:

The complete request path across all services.
Logs from each service in the call chain.
Performance metrics for each span.
Exact failure point and error messages.

A span graph showing durations of component calls.

SolarWinds natively supports W3C trace context, automatically generating flame graphs and waterfall visualizations. The visualizations show exactly where time is spent: 50ms in the API gateway, 200ms in business logic, and 2 seconds waiting for the database, revealing your bottlenecks immediately.

Connect trace context to full-stack observability

Trace context bridges application monitoring with infrastructure metrics. When your trace shows a slow database query, correlate it with CPU spikes on your database server. When spans show increased latency, check if it corresponds with memory pressure or network congestion. Correlating these signals transforms isolated data points into complete incident narratives.

With proper context propagation, you can gain visibility. Without it, you might see errors in Service C without knowing they originated from Service A's malformed request. You might optimize the wrong service because you can't see the full request flow. Proper context propagation allows breadcrumbs to follow issues from symptoms to root cause.

Consider semantics

Semantic correctness in your application's responses enables accurate monitoring, ML detection, and SLO calculations. This is primarily your responsibility as the application owner. APM tools capture what your services emit, following OTEL standards.

SolarWinds follows OTEL patterns where trace status is "UNSET," "ERROR," or "OK." It captures whatever HTTP status codes your application returns (200, 201, 204, 302, 304, 308, 500, etc.). It records these responses but doesn't create custom response code rules or modify what your application sends.

Your application design determines monitoring accuracy:

Return accurate HTTP status codes (429 for rate limits, not 400 or 200).
Structure error responses with both machine-readable codes and human-readable messages.
Use appropriate status codes for different scenarios (503 for temporary unavailability, not 500).
Include trace context in error responses for correlation.

When your APIs emit semantically correct responses, APM tools can properly categorize issues, calculate error rates, and trigger appropriate alerts. A 200 OK response with an error message in the body appears successful to monitoring tools, hiding real problems from your dashboards.

When a service produces an error, honoring HTTP semantics provides a shared language to discuss the state of the request, reducing confusion. A service returning a 200 header with a body of {"status": "error"} is hard to decipher out of context. Status codes should indicate the state of the request, while the error body provides insight into why the state is in error.

Scenario	Wrong Response	Correct Response	Why It Matters
Invalid API key	500 Internal Server Error	401 Unauthorized	Distinguishes authentication issues from server problems
Rate limit exceeded	400 Bad Request	429 Too Many Requests	Enables proper backoff and retry logic
Customer not found	404 Not Found	204 No Content	Differentiates missing resources from empty results
Payment declined	200 OK with error body	402 Payment Required	Allows proper error handling without parsing the body
Validation failed	500 Internal Server Error	422 Unprocessable Entity	Separates client errors from server failures

Error responses must be structured consistently to provide machines with parsable codes, humans with readable messages, trace IDs for debugging, and documentation links for resolution. For example:

{
  "error": {
    "code": "INSUFFICIENT_INVENTORY",
    "message": "Cannot fulfill order: only 3 items in stock, 5 requested",
    "details": {
      "available": 3,
      "requested": 5,
      "item_id": "SKU-12345"
    },
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "documentation": 
    "https://api.example.com/docs/errors#insufficient-inventory"
  }
}

Consider nuanced cases carefully

An API querying a database for a customer presents a semantic choice. If the customer doesn't exist:

404 Not Found implies "this customer ID is invalid, no need for additional queries."
204 No Content means "valid query, no results currently."

The distinction between 404 and 204 matters. Use 404 when the resource identifier is wrong (e.g., an invalid customer ID format). Use 204 when the query is valid but returns no results (i.e., no customers match your filter). Your choice affects how clients handle retries, how monitoring interprets errors, and how developers debug issues.

Operational consequences of poor semantics

Poor semantics can waste debugging time and break automation. Teams chase phantom server issues when malformed client input returns a 500 error instead of a 400 error. Self-healing systems that restart on 5xx errors won't help if you return 200 for failures. Circuit breakers won't trip if errors hide in response bodies.

ML models trained to recognize success patterns can't detect anomalies when failures return 200 status codes, missing problems that should trigger alerts.

Incorrect status codes also distort SLO calculations and error budgets. Marking client errors as server errors can make your service appear unreliable.

Accurate semantics form the contract between your services. When Service A receives 429 from Service B, it knows to back off. When it receives 503, it knows Service B is temporarily unavailable. When it receives 401, it knows to refresh credentials. Keep the semantics in line with the intended contract to maintain automation, monitoring, and debugging capabilities across your entire system.

Fail fast but gracefully

Services that fail silently are difficult to debug. A request that times out but returns a 200 OK status looks healthy to monitoring, while users see broken pages. The disconnect between monitoring and reality often hides issues until they escalate into incidents.

An example implementation for failing fast

To implement fast-failing patterns, set up timeout chains that prevent cascading failures. In the example below, each layer times before its caller, preventing hung requests from indefinitely consuming resources.

service_timeouts:
  # Fast user-facing operations
  frontend: 500ms
  backend: 300ms
  database: 200ms
  
  # For comparison: slower operations
  payment_processing: 10s
  report_generation: 30s

Implement circuit breakers to stop hammering failing services.

from pybreaker import CircuitBreaker

db_breaker = CircuitBreaker(fail_max=5, reset_timeout=60)

@db_breaker
def get_user(user_id):
    return database.query("SELECT * FROM users WHERE id=?", (user_id,))
    # After 5 failures, circuit opens for 60 seconds
    # Fast-fails instead of waiting for timeouts

Structure errors so they’re readable by humans and parsable by computers.

{
  "error": {
    "code": "INVENTORY_TIMEOUT",
    "message": "Inventory service did not respond within 8 seconds",
    "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
    "retry_after": 30,
    "fallback": "showing cached inventory counts"
  }
}

Finally, implement graceful degradation.

async function getProductDetails(productId) {
  try {
    return await inventoryService.get(productId, { timeout: 2000 });
  } catch (error) {
    logger.warn(`Inventory service timeout: ${error.message}`);
    // Degrade gracefully instead of error page
    return {
      ...basicProductInfo,
      stock: "Check availability",
      delivery: "2-5 business days"
    };
  }
}

What failure patterns look like in practice

The screenshot below illustrates how latency issues manifest differently from hard failures. The service overview reveals synchronized latency spikes across multiple services, yet no increase in error rate.

SolarWinds service overview page showing service health degradation.

Such synchronized patterns typically indicate resource contention or blocking operations rather than failures. The transaction trace view shows the same spikes in latency, with some traces running almost 10 seconds.

SolarWinds transaction trace view.

Drilling into specific traces reveals the bottleneck: inventorycontroller.getproducts is consuming 99% of the request time. The trace shows application code execution, not database slowness. The investigation revealed the Chaos Monkey endpoints were being triggered, simulating latency injection.

Span Details of the server call to GET product/list.

The example illustrates a crucial principle: feature flags and configuration changes can introduce latency without causing errors. Basic circuit breakers miss latency issues since the requests technically succeed. Use latency-aware circuit breakers that trip on slow responses. Your error monitoring stays green while users abandon slow pages. Hence, you must monitor all RED metrics, not errors alone.

Without proper fail-fast patterns, these slow requests consume resources for 10 seconds each, potentially causing cascade failures under load. With timeouts and circuit breakers, the system degrades gracefully, serving cached data or default responses rather than hanging.

Code profiling capabilities in modern APM tools immediately show which code path executes slowly, down to the specific method and line number. Instead of hunting through services, you can observe where time is spent.

Monitor in depth and breadth

Surface-level observations tell you something is wrong, but not why. Effective monitoring requires both depth and breadth.

Monitoring in depth

Depth means tracing requests through every component they touch. Proper trace propagation and context-aware logging are required to follow issues from symptoms to root cause.

What to capture at each layer:

Application code: Method execution times, memory allocations, CPU profiling
Database queries: Query execution plans, lock wait times, connection pool status
External API calls: Response times, retry counts, circuit breaker states
Message queues: Queue depth, processing lag, poison message counts

Modern APM tools provide code-level profiling that shows which methods consume time, which database queries run slowly, and which external calls fail. Granular visibility transforms "the app is slow" into "the getUserPreferences() method takes 3 seconds due to an N+1 query pattern."

Monitoring in breadth

Breadth covers all services, dependencies, and infrastructure that could impact users.

All service endpoints: Admin interfaces, internal APIs, not just critical paths
Third-party dependencies: Payment processors, email services, content delivery networks, authentication providers
Infrastructure health: Container restarts, memory pressure, disk I/O, network latency
User experience metrics: Page load times, JavaScript errors, mobile app crashes
Business metrics: Conversion rates, cart values, API usage against quotas

Combining depth and breadth with ML-powered insights

The intersection of depth and breadth reveals complex issues. A memory leak in one service (depth) might only manifest under specific traffic patterns across multiple services (breadth).

SolarWinds and similar platforms leverage ML to correlate patterns across both dimensions. The ML models learn your application's normal behavior, then detect anomalies that human-defined thresholds would miss, such as:

Gradual performance degradation over days.
Unusual correlation between seemingly unrelated services.
Traffic pattern changes that precede failures.
Resource consumption trending toward limits.

Practical implementation

Start with breadth to detect problems quickly, then use depth to diagnose root causes. A typical investigation flow could be:

Alert triggers: Overall error rate exceeds baseline (breadth signal).
Service identification: Dashboard shows payment service errors spiking (breadth narrowing).
Trace analysis: Distributed trace reveals timeout calling fraud detection API (depth investigation).
Code profiling: Method-level timing shows the new fraud rule taking five times longer (depth diagnosis).
Resolution: The rule is optimized or the timeout is increased (targeted fix).

Without breadth, you miss the alert; without depth, you can't fix it. Together, they transform monitoring from reactive firefighting to proactive problem-solving.

RED metrics give you the leading indicators. Depth and breadth monitoring reveals whether issues are isolated to a single service or cascading across your entire ecosystem. Combined, they provide the complete context needed to resolve issues before they impact users on a large scale.

Final thoughts

Sound APM prevents outages rather than simply detecting them. The practices in this guide help transform reactive firefighting into proactive problem-solving.

AI-powered platforms enhance fundamentals by detecting anomalies that static thresholds miss. However, tools without methodology lead to alert fatigue and missed incidents. Implement RED metrics, add trace context, and configure intelligent sampling. Layer in advanced capabilities as your observability matures.

The goal is to identify issues, determine their root causes, and resolve problems before customers can find them. When done right, APM turns midnight emergencies into business-hour fixes and angry customers into prevented incidents.

Ready to achieve visibility over your entire IT estate?

Learn More