A Best Practice Guide to Monitoring and Observability
Introduction
- Performance optimizations: Retrieving performance insights allows data professionals to optimize their databases, which ultimately results in a better functioning application and improved user experience
- Fast troubleshooting: Well-monitored databases facilitate easier and quicker root cause analysis (RCA) to find poorly functioning system components
- Effortless scalability: Insights into resource allocation and usage allow engineers to understand how to scale database infrastructure without unforeseen complications.
On the other hand, a lack of efficient database monitoring can lead to slow query response times, system downtime, inefficient resource usage, and scaling issues. These consequences can lead to a poor user experience and negatively affect business, productivity, and even revenue.
This article examines database monitoring in detail, covering best practices for database infrastructure monitoring, the four “golden signals” for effective database monitoring, and how to use database monitoring to drive performance improvements.
Summary of database monitoring best practices
Best Practices
| Description |
|---|---|
Use the right metrics to monitor database performance | Tracking, implementing, and utilizing key indicators such as the four golden signals (latency, traffic, errors, and saturation) to monitor database performance. |
Ensure monitoring of both database health and availability
| Tracking the general condition of a database system, e.g, deadlocks, downtime, connectivity. |
Implement non-metric database monitoring | Monitoring non-metric elements of a database workload, such as changes to database schema objects. |
Perform query analysis to identify bottlenecks | Monitoring the efficiency of individual queries to improve overall performance. |
Analyze trends to establish baselines and improve performance over time | Using historical data to establish performance baselines and benchmarks, investigate database performance over time, establish patterns, and promote long-term planning and scalability. |
Alert intelligently | Setting acceptable and critical metric thresholds that alert administrators when breached. |
Adopt a continuous improvement approach to database monitoring | Monitoring and alerting should not be considered static. After every major incident and service outage, a post-mortem process is required to incorporate new learnings into the monitoring system. |
Use the right metrics to monitor database performance
Metrics are the key indicators used to examine the internal state of a system. They are essential to implementing observability measures into an application’s infrastructure, as they provide us with functional data necessary for performance monitoring and improving performance. Engineering teams use this data to quickly identify bottlenecks and potential performance issues within their databases.
The specific metrics monitored for a given database system can vary depending on the business requirements. However, the four golden signals of monitoring the performance of distributed systems, as defined by Google’s Site Reliability Engineering (SRE) division, are as follows, are staples of database monitoring that are essential to ensuring observability and reliable performance. That is because the four golden signals (latency, traffic, errors, and saturation) provide engineers with a robust understanding of overall database system health and performance.

The four golden signals of monitoring (source)
These metrics are most valuable for providing a quick yet comprehensive overview of database performance. If these metrics show a problem, then a further root cause analysis is typically required.
Latency
Latency is the total amount of time required to complete a database operation. This includes:
- Query execution time
- Connection establishment time
- Any additional waiting periods
High-latency periods typically symbolize inefficient queries, resource overutilization, or network issues. How you measure latency is important. For example, let’s say we are measuring latency using 1000 database queries, and 990 queries are executed in 5ms while 10 are completed in 1000ms each. If you were to calculate the average latency, it would be approximately 15ms.
While the average seems low, the latency distribution in reality is skewed, as some queries are taking longer and potentially disrupting the user experience. For this reason, using percentiles to describe latencies provides users with a clearer picture of performance.
Organizations usually set performance objectives in terms of percentile objectives, also known as Service Level Objectives (SLO). A sample SLO could be that the 99th percentile of queries should meet a 10ms latency threshold, implying that only 1% of users would experience query response times slower than that.
Traffic
Traffic refers to the volume of activity a database handles. Monitoring traffic enables engineers to identify when and which database components require additional management and scaling. Databases not optimized to handle high traffic levels can experience rapid performance degradation, particularly when the number of users increases unexpectedly. Traffic is typically measured using transactions per second (TPS) or queries per second (QPS), also known as throughput.
Typically, enterprises design their database systems to support a maximum level of traffic by defining the amount of available server resources (such as RAM, number of CPUs, the speed of the I/O subsystem, etc.) and through the architecture of the database itself. The process of determining what level of traffic a given database system can efficiently support is known as 'benchmarking.' Once the available server resources and architecture are set, they are seldom changed unless a performance problem arises.
While monitoring traffic on a database, the "normal" day-to-day range of TPS/QPS is known as the baseline, say 500 QPS. The least busy and most busy times of day are known as the low "watermark" and the high "watermark", respectively. So, on a database application where a normal workload averages around 500 QPS, it might briefly peak at a high watermark of 1500 QPS, then drop to only 50 QPS during the off-hours.
Errors
Errors measure the number of failed activities incurred during database operations and can provide critical insights about an application’s health. The error might occur inside the database or indicate an error condition outside the database that prevents it from further processing an operation, such as a DNS error. Errors within a database typically include the following failures:
- Failed queries
- Connection failures
- Inadequate security credentials
- Timeouts
- Unexpected responses
Every database application can expect errors as part of routine operations. However, the absolute number of errors, the error severity, and the error rate (number of errors per second) can help engineers better understand the health of their database. Sometimes, response times and throughput appear healthy, but errors could be high due to a serious database problem like deadlocks. In these situations, errors allow the engineers to better understand where databases are functioning poorly, what transactions are involved, and where to start remediation.
Saturation
Saturation defines how the resources of a database system (CPU utilization, memory usage, disk I/O) are utilized. The greater the saturation, the more database resources are consumed at any given moment. Database performance slows down when saturation limits are reached.
Understanding saturation levels can help engineers adjust resource allocation as needed and prepare for scaling when databases grow in traffic and complexity. At a minimum, engineers need to monitor CPU, Memory, IOPs, and Network utilization. CPU utilization is typically one of the most monitored saturation metrics. It is measured in percentages, and the optimal CPU utilization for most systems is typically 35-70%. However, ideal CPU utilization depends on the type of workload, hardware, and latency requirements. For example, high-throughput applications that are latency sensitive might have a CPU saturation of 70-90% under a normal workload. For these types of applications, it’s important to ensure sufficient unutilized capacity (also known as “headroom”) to handle peak usage and high traffic. Performance degradation and slow response times should be expected whenever any of the key resources, such as CPU, Memory, IOPs, or Network, exceed ideal limits for prolonged periods.
Database monitoring tools
Manual collection of metric data can be a tedious and mechanical process. Database administrators typically set up monitoring tools within their database systems to automate the collection and analysis of metric data. For instance, SolarWinds® Database Performance Analyzer (DPA) goes one step beyond collecting, correlating, and displaying metrics to visualize performance anomalies and identify tuning opportunities.

Ensure monitoring of both database health and availability
Health and availability monitoring primarily involves assessing the integrity of a database system and evaluating its operational reliability. Monitoring these aspects of system performance helps engineers make the databases they manage more reliable, available, and fault-tolerant. If the database application should ever crash or otherwise become unavailable, monitoring ensures that the engineers can respond quickly to remediate the issue.
Database health monitoring
Health monitoring focuses on a database's ability to function normally and operate as expected. Observing the health of a database enables engineers to prevent outages, downtime, and errors before they occur by identifying warning signs. When monitoring is coupled with effective alerting, many IT organizations are able to detect, diagnose, and resolve issues before end-users ever notice.
While there are many important aspects to database health, two key areas of database health monitoring include:
- Deadlocks, which occur when multiple transactions compete for resources held by each other, causing all transactions in the deadlock to wait indefinitely for another transaction to release its locks.
- Background processes, which vary greatly across database platforms, are needed by the databases to maintain normal operations. Examples include the autovacuum process in PostgreSQL, which automates disk space recovery and updates database index statistics, and the DBCC process in Microsoft SQL Server, which identifies and corrects corrupt sectors on the storage subsystem.
For example, a sudden rise in deadlocks might imply that the newest release of the application has poorly designed SQL transactions or that a new table is improperly indexed. The database engine background processes usually perform deadlock detection. For example, SQL Server utilizes a lock monitor thread to periodically analyze locking and identify deadlocks. The time interval between searches is auto-calculated by the engine, without the need for engineers to manually intervene.
Similarly, ensuring vacuuming processes are operating as expected is an important aspect of database health monitoring. PostgreSQL engineers typically monitor VACUUM metrics when autovacuuming is configured. To ensure that autovacuums occur timely and correctly, metrics like dead rows ( the number of rows in a table that are marked for deletion, but have not yet been purged from the database), table disk usage (amount of space a table is using—unexpectedly high numbers signify vacuuming issues), and the last time vacuum was run, are monitored.
Modern monitoring solutions such as SolarWinds Observability SaaS can help automate the health monitoring process and immediately alert users when database health checks fail. The observability tool also provides a root cause assist feature to provide insights into why a health degradation occurred, coupled with an AI-powered summary of underlying issues and potential solutions.

Database availability monitoring
In database systems, availability is a term used to mean “redundancy”, so that there is no single point of failure in the system. In business-critical database systems, it’s not unusual to have a primary database server with one or more secondary database servers available for immediate failover should the primary database ever crash. Availability should not be confused with the concept of recoverability, which guarantees system uptime by providing the engineers with backups to apply in the event of a database crash. While recoverability is usually measured in hours to complete, highly available database systems can recover from a crash in seconds to minutes.
- Uptime: A percentage-based statistic that measures the accessibility of a database over a specified period. Some industry-based standards of yearly uptime (the nines of availability) are as follows:
Availability | Downtime (per year) |
|---|---|
99% (2 nines)
| 3.65 days |
99.9% (3 nines) | 8.76 hours |
99.99% (4 nines)
| 52.56 minutes |
99.999% (5 nines) | 5.26 minutes |
- TCP Port availability: Whether specific database-related ports are available and open to connection (3306 for MySQL, 5432 for PostgreSQL, 1433 for SQL Server).
- Synchronization State: Indicating the synchronization state of the secondary replicas, which can impact the availability and performance of the database.
Implement non-metric database monitoring
While monitoring metrics tends to be the most straightforward method of quantifying database performance and efficiency, integrating non-metric monitoring also provides data professionals with valuable insights into their database systems. Non-metric monitoring focuses mainly on qualitative changes and events rather than numerical figures. These monitoring practices can be beneficial when conducting RCA.
For instance, if a recent schema change is found to cause an unexpected shift in production behavior, being able to quickly locate the engineer who made this change, using a schema tracking mechanism, usually provides enough context to investigate further almost immediately. The following sections explore two key areas of non-metric database monitoring.
Schema change observability
Schemas define the architecture (tables and indexes, for example), logic (such as stored procedures and SQL queries), and constraints (like the primary keys and foreign keys on the tables) of a database system. Monitoring Data Definition Language (DDL) SQL transactions made to these components is vital to the integrity of a database, as it gives administrators better visibility into the respective downstream impacts. DDL transactions in SQL are the CREATE, ALTER, and DROP statements. For instance, if an engineer changes the name of a column in a table and causes an unexpected production failure, senior management can quickly pinpoint the source and reasoning behind the change, but only when schema change observability is integrated into their application. Let's look at a few examples of what you can track when monitoring database schemas:
- Tables, indexes, or column creation, deletion, or modification (statements using
CREATE, ALTER, andDROP) - Addition, deletion, or modification of indexes
- Modification of views or stored procedures
- Modifications made to primary or foreign keys
There are several methods for integrating DDL tracking into a database system. For instance, SQL Server allows users to create DDL triggers that are fired when certain events occur.
Let’s assume you want to create an audit trail for all changes made to stored procedures in your database. As an example, you can first create the following database table to record all actions taken on pre-existing stored procedures:
CREATE TABLE dbo.DDLEvents
(
EventDate DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
EventType NVARCHAR(64),
EventDDL NVARCHAR(MAX),
EventXML XML,
DatabaseName NVARCHAR(255),
SchemaName NVARCHAR(255),
ObjectName NVARCHAR(255),
HostName VARCHAR(64),
IPAddress VARCHAR(48),
ProgramName NVARCHAR(255),
LoginName NVARCHAR(255)
);CREATE TRIGGER DDLTrigger_Sample
ON DATABASE
FOR CREATE_PROCEDURE, ALTER_PROCEDURE, DROP_PROCEDURE
AS
BEGIN
SET NOCOUNT ON;
DECLARE
@EventData XML = EVENTDATA();
DECLARE @ip varchar(48) = CONVERT(varchar(48),
CONNECTIONPROPERTY('client_net_address'));
INSERT AuditDB.dbo.DDLEvents
(
EventType,
EventDDL,
EventXML,
DatabaseName,
SchemaName,
ObjectName,
HostName,
IPAddress,
ProgramName,
LoginName
)
SELECT
@EventData.value('(/EVENT_INSTANCE/EventType)[1]', 'NVARCHAR(100)'),
@EventData.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'NVARCHAR(MAX)'),
@EventData,
DB_NAME(),
@EventData.value('(/EVENT_INSTANCE/SchemaName)[1]', 'NVARCHAR(255)'),
@EventData.value('(/EVENT_INSTANCE/ObjectName)[1]', 'NVARCHAR(255)'),
HOST_NAME(),
@ip,
PROGRAM_NAME(),
SUSER_SNAME();
ENDBackup and replication monitoring
Like any other critical component of an application performance, databases need to be backed up regularly in case of data loss when unforeseen events like hardware failures or malicious attacks occur. The IT industry, broadly speaking, refers to this discipline as “disaster recovery.” Engineering teams usually set up automated backup jobs to ensure data is still recoverable in the event of production data becoming corrupted or unavailable.
How often these backup jobs run depends on the activity level of the database and the risk of data loss. Backup monitoring mainly involves tracking the status of these backup jobs by looking at factors such as backup success or failure, backup completion percentage, backup frequency, and backup error messages. Service-level agreements (SLAs) and service-level objectives (SLOs) are goals set between users and data teams to establish expected levels of performance and availability. SLAs are contracts between customers and service providers that define what service levels are expected, and what the consequences of not meeting these service levels are (penalties or payment negotiations). SLOs are the specific target levels that engineering teams must meet to comply with SLAs. For example, one of the SLOs of an SLA could be 99.9% database uptime. Efficient backup processes allow database teams to meet SLO requirements more easily and ultimately prevent revenue losses.
Similarly, recovery time objectives (RTOs) and recovery point objectives (RPOs) set expectations between customer and database teams about backup and recovery implications. An RTO sets the maximum amount of time between a service disruption of a database and the time at which it is satisfactorily operational again. The lesser the time value of an RTO, the higher the recovery cost usually is. RPOs, on the other hand, specify the amount of time between when data is backed up and when it is needed again, following an outage. Again, a lower RPO implies a higher cost due to the necessity of high-frequency backups. RPOs and RTOs should be set while keeping in mind both costs and the reasonableness of backup recovery. Ultimately, accurate backup monitoring allows database teams to more easily comply with SLAs, SLOs, RTOs, and RPOs.
Large organizations sometimes maintain multiple, synchronized copies of a database server to be accessible in production for various reasons, such as replacement in case of a failure and load balancing, such as moving the processing of big, monthly reports from the primary server to a secondary server. This is called database replication. Replication is a form of high availability (databases designed to have minimal or no interruptions), although it is slower and cheaper than maintaining multiple primary and secondary database replicas. The key aspect of replication to be monitored is whether database replicas are synchronized with the primary server. Monitoring systems would typically watch for replication lag (the period of delay between the time a transaction is committed on the primary and when it is synchronized on the replica server), replication throughput, replication error rates, and error messages.
Perform query analysis to identify bottlenecks
While query execution time is a latency metric we monitor, query analysis goes deeper to understand why queries are inefficient and what is needed to optimize them. Queries constitute a significant part of a database system’s total workload (usually 80% or more), as they are the building blocks of adding, updating, and removing data. Large applications that consume terabytes of data can benefit from even marginal improvements in query performance. Looking at a basic example, let’s assume an e-commerce organization uses the following SQL Server query to find the total sales for orders placed in 2024, in a database containing millions of rows of data:
SELECT SUM(OrderTotal) AS TotalSales
FROM Orders
WHERE YEAR(OrderDate) = 2024;Assuming a nonclustered index exists on OrderDate, the above query prevents the database engine from using it, because the function YEAR() disables the index, thus causing a full table scan instead of a desired index seek.
On the other hand, if the organization framed the query as:
SELECT SUM(OrderTotal) AS TotalSales
FROM Orders
WHERE OrderDate >= '2024-01-01'
AND OrderDate < '2025-01-01';
The presence of search arguments, often called “sargs”, enables the engine to use the index on OrderDate and retrieve the required data much faster, thereby significantly reducing response time. In this situation, a query analysis of the above script would immediately reveal that it is performing inefficiently.
Your query performance can be evaluated by metrics similar to general performance metrics, such as query execution time, CPU utilization, and query error rates. These database monitoring metrics are collected during execution.
Query execution plans
The first step to improving query performance is to understand how the query currently works. Because relational databases use a cost-based query optimizer, engineers are only able to see how the database engine executes the query after the fact. To facilitate this process, database engines offer execution plans, which dissect a query’s execution roadmap for users. Using our previously mentioned inefficient query to compute total sales, let’s take a look at how we can view an execution plan using PostgreSQL’s EXPLAIN ANALYZE keywords:
EXPLAIN ANALYZE
SELECT SUM(order_total) AS total_sales
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024;
Seq Scan on orders (cost=0.00..4589.35 rows=200000 width=8)
(actual time=0.015..92.634 rows=200000 loops=1)
Filter: (date_part('year'::text, order_date) = '2024'::double precision)
Rows Removed by Filter: 800000
Planning Time: 0.120 ms
Execution Time: 94.218 ms
Immediately, we notice that the engine conducts a sequential scan (Seq Scan) of all rows. Execution plans like these can provide engineers with a quick and general overview of their queries, encouraging them to consider optimization strategies. However, PostgreSQL does not differentiate between scans and seeks in its execution plans.
Optimizing queries
The manual analysis and optimization of queries can become tedious and confusing, especially for organizations with numerous databases and large volumes of data. Monitoring tools reduce the burden on engineers to scrutinize their scripts and individual queries by automatically collecting data, comparing metrics, and providing recommendations. For example, SolarWinds Database Performance Analyzer provides users with an AI Query Assist (AIQA) feature, which helps you rewrite suboptimal queries in a matter of seconds using generative AI (GenAI), keeping in mind both performance and readability:

While such tools significantly simplify the query optimization process, it’s good, as a database engineer, to know a few best practices involved when it comes to improving query performance:
- Indexing opportunities: When strategically implemented, indexes can drastically speed up query response times. However, they come with a storage overhead, and the cost-benefit tradeoff should be considered.
- Caching mechanisms: Query-result caching and database-level caching can reduce repeated query executions for frequently accessed data.
- Materialized views: Stored view results that can reduce response times for read-heavy queries
- Query management: Periodic refactoring of queries to remove inefficient practices such as
SELECT *, JOINs on large tables, etc.
Analyze trends to establish baselines and improve performance over time
Trend analysis goes one step further by utilizing metrics, collecting and analyzing data over time to provide database administrators with a deeper understanding of how their databases are evolving, in terms of performance and stability. Trend analysis is typically a long-term monitoring practice, rather than quickly examining metrics to resolve a bug or production issue. For example, let’s assume you manage the database for an online store.
A price-retrieval feature has begun performing slowly over the last 6 months. If you have been collecting trend data, you can examine how response times have changed for all queries involved over the past six months, and then investigate further to determine the cause of the performance deterioration. You could also monitor recurring patterns (eg, CPU spikes), retroactively correlate database structural changes with performance, and predict future database behavior and resource requirements. Trend analysis can be enabled simply by ensuring that the performance and health metrics you monitor are retained instead of discarded.
How to analyze trends
While different organizations have different trend analysis needs depending on their database sizes, business requirements, and application nature, analyzing trends involves a few common steps:
- Collect metric data and store it in a database for chronological analysis. Typically, a minimum of three months of performance data is needed for a normal trend analysis. More data offers better long-term trending, but the amount of storage needed to store all of that data should be closely watched. Query Store is a popular SQL Server tool that stores query performance data over time, allowing engineers to analyze trends conveniently.
- Set baselines and automate alerts when metrics start to deviate from historical averages.
- Periodically view and investigate metric trends to identify any severe or concerning changes. Compare this data week by week, month by month, and so on, to see how performance has evolved in both the short term and long term. In general, look for metrics that significantly deviate from the norm.
- Visualize trends using UI tools, such as Grafana or PowerBI, to help easily identify spikes or unusual patterns in trends.
- Closely monitor data when database structure changes are made (eg, schema changes) to see if metrics react differently, in a historical context.
Predictive modelling
- Using historical growth data to understand when to amplify CPU resources
- Predict query response times when introducing new features
- Predict how an increase or decrease in database downtime will affect yearly user traffic

Alert intelligently
Alerting is arguably the most useful aspect of monitoring databases, as it forms the key link between the detection of an issue and the database teams proactively firefighting or remediating production issues. Alerting releases pressure on engineers to constantly monitor database performance and availability while they work on other tasks. Alerts are typically set up to notify teams when metrics exceed threshold levels or exhibit other anomalous behavior. The benefits of having a comprehensive alerting system for your database systems are:
- Address issues and bugs before they impact production environments
- Maintain expected performance levels
- Keep downtime to a minimum
- Meet service level agreement (SLA) objectives
- Decrease the average mean time to resolution (MTTR)
- CPU Utilization: Send an alert when CPU utilization exceeds 80% for over 4 minutes
- Query Response times: Send an alert when the response time of a critical query exceeds 2 times the historical average
- Traffic: Send an alert when traffic spikes 50% over the normal pattern for a particular period of the day or week
- Availability: Send a critical alert when a database becomes unresponsive for more than 3 seconds
For example, let’s assume an e-commerce company has an alerting mechanism set up within its database monitoring system. During holiday periods, they want to ensure their checkout databases are reliable and that they are informed immediately of any potential issues, to avoid any revenue loss due to purchase failures. The database team decides to set up alerts for all queries related to business-critical functions, such as checkouts, purchases, user authentications, and product detail retrievals.
To avoid false positives and alert fatigue, non-relevant reporting and analytical queries are not configured to deliver these specialized alerts. These alerts are also only required during holiday periods (such as Christmas and Easter) and are enabled two weeks before the official holidays, on weekday evenings and weekends. To do this, engineers create multi-level threshold query latency alerts using the following reasoning:
Incident
| Alert
|
|---|---|
Average query latency exceeds 200ms for more than 3 minutes | Send a warning alert by email and Slack |
Average query latency exceeds 400ms for more than 1 minute | Send a warning alert by email, SMS, and create an incident ticket |
Average query latency exceeds 600ms for more than 30 seconds
| On-call DBA/senior engineers are given an automated phone call and asked to resolve the wait times immediately |

Adopt a continuous improvement approach to database monitoring

- Clear baselines: Normal performance levels, availability, and health levels should be defined and reviewed after major database changes and improvements.
- Documentation: Documenting monitoring practices, metric baselines, and trend data can establish guidelines for continuous improvement and also acclimate new engineers smoothly
- Post-mortem analysis: Major incidents and outages can discourage database teams, but should be seen as learning opportunities to improve monitoring processes. A root cause analysis of metrics, alerts, and logs can enable alerts to detect or even prevent the same issue from occurring in the future, as well as point to where the monitoring process failed and how it can be improved.
- Alerting: Just like metrics, alerts should be reviewed often. Performance impacts can change in terms of how severe they are considered over time, and alerts need to be adjusted accordingly to reflect the perceived urgency. Multi-threshold alerting can also be very useful with critical processes.
- Automation: Focus on automating most monitoring practices. Automating performance and health checks is quicker for identifying underlying issues and anomalies, but also eliminates the potential for human error, which can potentially have business-critical impacts. In addition, many alerts enable automated responses. For example, when a user process causes a deadlock, an alert might simply cancel the deadlocked transaction and automatically restart it without the user even knowing an automated remediation was invoked.