A Best Practice Guide to Monitoring and Observability

Monitoring and Observability Guides

Introduction

Database monitoring is the practice of implementing observable measures into a database system to analyze performance, stability, and availability. Effective monitoring provides organizations with the following database benefits:

Performance optimizations: Retrieving performance insights allows data professionals to optimize their databases, which ultimately results in a better functioning application and improved user experience
Fast troubleshooting: Well-monitored databases facilitate easier and quicker root cause analysis (RCA) to find poorly functioning system components
Effortless scalability: Insights into resource allocation and usage allow engineers to understand how to scale database infrastructure without unforeseen complications.

On the other hand, a lack of efficient database monitoring can lead to slow query response times, system downtime, inefficient resource usage, and scaling issues. These consequences can lead to a poor user experience and negatively affect business, productivity, and even revenue.

This article examines database monitoring in detail, covering best practices for database infrastructure monitoring, the four “golden signals” for effective database monitoring, and how to use database monitoring to drive performance improvements.

Summary of database monitoring best practices

The table below summarizes the seven database monitoring best practices this article will explore in detail.

Best Practices	Description
Use the right metrics to monitor database performance	Tracking, implementing, and utilizing key indicators such as the four golden signals (latency, traffic, errors, and saturation) to monitor database performance.
Ensure monitoring of both database health and availability	Tracking the general condition of a database system, e.g, deadlocks, downtime, connectivity.
Implement non-metric database monitoring	Monitoring non-metric elements of a database workload, such as changes to database schema objects.
Perform query analysis to identify bottlenecks	Monitoring the efficiency of individual queries to improve overall performance.
Analyze trends to establish baselines and improve performance over time	Using historical data to establish performance baselines and benchmarks, investigate database performance over time, establish patterns, and promote long-term planning and scalability.
Alert intelligently	Setting acceptable and critical metric thresholds that alert administrators when breached.
Adopt a continuous improvement approach to database monitoring	Monitoring and alerting should not be considered static. After every major incident and service outage, a post-mortem process is required to incorporate new learnings into the monitoring system.

Use the right metrics to monitor database performance

Metrics are the key indicators used to examine the internal state of a system. They are essential to implementing observability measures into an application’s infrastructure, as they provide us with functional data necessary for performance monitoring and improving performance. Engineering teams use this data to quickly identify bottlenecks and potential performance issues within their databases.

The specific metrics monitored for a given database system can vary depending on the business requirements. However, the four golden signals of monitoring the performance of distributed systems, as defined by Google’s Site Reliability Engineering (SRE) division, are as follows, are staples of database monitoring that are essential to ensuring observability and reliable performance. That is because the four golden signals (latency, traffic, errors, and saturation) provide engineers with a robust understanding of overall database system health and performance.

The four golden signals of monitoring (source)

These metrics are most valuable for providing a quick yet comprehensive overview of database performance. If these metrics show a problem, then a further root cause analysis is typically required.

Latency

Latency is the total amount of time required to complete a database operation. This includes:

Query execution time
Connection establishment time
Any additional waiting periods

High-latency periods typically symbolize inefficient queries, resource overutilization, or network issues. How you measure latency is important. For example, let’s say we are measuring latency using 1000 database queries, and 990 queries are executed in 5ms while 10 are completed in 1000ms each. If you were to calculate the average latency, it would be approximately 15ms.

While the average seems low, the latency distribution in reality is skewed, as some queries are taking longer and potentially disrupting the user experience. For this reason, using percentiles to describe latencies provides users with a clearer picture of performance.

Organizations usually set performance objectives in terms of percentile objectives, also known as Service Level Objectives (SLO). A sample SLO could be that the 99th percentile of queries should meet a 10ms latency threshold, implying that only 1% of users would experience query response times slower than that.

Traffic

Traffic refers to the volume of activity a database handles. Monitoring traffic enables engineers to identify when and which database components require additional management and scaling. Databases not optimized to handle high traffic levels can experience rapid performance degradation, particularly when the number of users increases unexpectedly. Traffic is typically measured using transactions per second (TPS) or queries per second (QPS), also known as throughput.

Typically, enterprises design their database systems to support a maximum level of traffic by defining the amount of available server resources (such as RAM, number of CPUs, the speed of the I/O subsystem, etc.) and through the architecture of the database itself. The process of determining what level of traffic a given database system can efficiently support is known as 'benchmarking.' Once the available server resources and architecture are set, they are seldom changed unless a performance problem arises.

While monitoring traffic on a database, the "normal" day-to-day range of TPS/QPS is known as the baseline, say 500 QPS. The least busy and most busy times of day are known as the low "watermark" and the high "watermark", respectively. So, on a database application where a normal workload averages around 500 QPS, it might briefly peak at a high watermark of 1500 QPS, then drop to only 50 QPS during the off-hours.

Errors

Errors measure the number of failed activities incurred during database operations and can provide critical insights about an application’s health. The error might occur inside the database or indicate an error condition outside the database that prevents it from further processing an operation, such as a DNS error. Errors within a database typically include the following failures:

Failed queries
Connection failures
Inadequate security credentials
Timeouts
Unexpected responses

Every database application can expect errors as part of routine operations. However, the absolute number of errors, the error severity, and the error rate (number of errors per second) can help engineers better understand the health of their database. Sometimes, response times and throughput appear healthy, but errors could be high due to a serious database problem like deadlocks. In these situations, errors allow the engineers to better understand where databases are functioning poorly, what transactions are involved, and where to start remediation.

Saturation

Saturation defines how the resources of a database system (CPU utilization, memory usage, disk I/O) are utilized. The greater the saturation, the more database resources are consumed at any given moment. Database performance slows down when saturation limits are reached.

Understanding saturation levels can help engineers adjust resource allocation as needed and prepare for scaling when databases grow in traffic and complexity. At a minimum, engineers need to monitor CPU, Memory, IOPs, and Network utilization. CPU utilization is typically one of the most monitored saturation metrics. It is measured in percentages, and the optimal CPU utilization for most systems is typically 35-70%. However, ideal CPU utilization depends on the type of workload, hardware, and latency requirements. For example, high-throughput applications that are latency sensitive might have a CPU saturation of 70-90% under a normal workload. For these types of applications, it’s important to ensure sufficient unutilized capacity (also known as “headroom”) to handle peak usage and high traffic. Performance degradation and slow response times should be expected whenever any of the key resources, such as CPU, Memory, IOPs, or Network, exceed ideal limits for prolonged periods.

Database monitoring tools

Manual collection of metric data can be a tedious and mechanical process. Database administrators typically set up monitoring tools within their database systems to automate the collection and analysis of metric data. For instance, SolarWinds^® Database Performance Analyzer (DPA) goes one step beyond collecting, correlating, and displaying metrics to visualize performance anomalies and identify tuning opportunities.

SolarWinds Database Performance Analyzer (DPA)

Ensure monitoring of both database health and availability

Health and availability monitoring primarily involves assessing the integrity of a database system and evaluating its operational reliability. Monitoring these aspects of system performance helps engineers make the databases they manage more reliable, available, and fault-tolerant. If the database application should ever crash or otherwise become unavailable, monitoring ensures that the engineers can respond quickly to remediate the issue.

Database health monitoring

Health monitoring focuses on a database's ability to function normally and operate as expected. Observing the health of a database enables engineers to prevent outages, downtime, and errors before they occur by identifying warning signs. When monitoring is coupled with effective alerting, many IT organizations are able to detect, diagnose, and resolve issues before end-users ever notice.

While there are many important aspects to database health, two key areas of database health monitoring include:

Deadlocks, which occur when multiple transactions compete for resources held by each other, causing all transactions in the deadlock to wait indefinitely for another transaction to release its locks.
Background processes, which vary greatly across database platforms, are needed by the databases to maintain normal operations. Examples include the autovacuum process in PostgreSQL, which automates disk space recovery and updates database index statistics, and the DBCC process in Microsoft SQL Server, which identifies and corrects corrupt sectors on the storage subsystem.

For example, a sudden rise in deadlocks might imply that the newest release of the application has poorly designed SQL transactions or that a new table is improperly indexed. The database engine background processes usually perform deadlock detection. For example, SQL Server utilizes a lock monitor thread to periodically analyze locking and identify deadlocks. The time interval between searches is auto-calculated by the engine, without the need for engineers to manually intervene.

Similarly, ensuring vacuuming processes are operating as expected is an important aspect of database health monitoring. PostgreSQL engineers typically monitor VACUUM metrics when autovacuuming is configured. To ensure that autovacuums occur timely and correctly, metrics like dead rows ( the number of rows in a table that are marked for deletion, but have not yet been purged from the database), table disk usage (amount of space a table is using—unexpectedly high numbers signify vacuuming issues), and the last time vacuum was run, are monitored.

Modern monitoring solutions such as SolarWinds Observability SaaS can help automate the health monitoring process and immediately alert users when database health checks fail. The observability tool also provides a root cause assist feature to provide insights into why a health degradation occurred, coupled with an AI-powered summary of underlying issues and potential solutions.

SolarWinds Observability Root Cause Assist with AI Five Whys Summary.

Database availability monitoring

In database systems, availability is a term used to mean “redundancy”, so that there is no single point of failure in the system. In business-critical database systems, it’s not unusual to have a primary database server with one or more secondary database servers available for immediate failover should the primary database ever crash. Availability should not be confused with the concept of recoverability, which guarantees system uptime by providing the engineers with backups to apply in the event of a database crash. While recoverability is usually measured in hours to complete, highly available database systems can recover from a crash in seconds to minutes.

Availability monitoring focuses on ensuring a database is up and running properly. Unlike health monitoring, availability monitoring assesses whether the database is reachable when needed, rather than focusing on the system's internal operations. Because highly available database systems typically have two or more servers synchronizing data around the clock, availability monitoring also watches data synchronization for any sign of issues. A database being available is usually one of the most important business requirements of an application, as it always directly affects user experience. Common metrics used to measure database availability include:

Uptime: A percentage-based statistic that measures the accessibility of a database over a specified period. Some industry-based standards of yearly uptime (the nines of availability) are as follows:

Availability	Downtime (per year)
99% (2 nines)	3.65 days
99.9% (3 nines)	8.76 hours
99.99% (4 nines)	52.56 minutes
99.999% (5 nines)	5.26 minutes

TCP Port availability: Whether specific database-related ports are available and open to connection (3306 for MySQL, 5432 for PostgreSQL, 1433 for SQL Server).
Synchronization State: Indicating the synchronization state of the secondary replicas, which can impact the availability and performance of the database.

Implement non-metric database monitoring

While monitoring metrics tends to be the most straightforward method of quantifying database performance and efficiency, integrating non-metric monitoring also provides data professionals with valuable insights into their database systems. Non-metric monitoring focuses mainly on qualitative changes and events rather than numerical figures. These monitoring practices can be beneficial when conducting RCA.

For instance, if a recent schema change is found to cause an unexpected shift in production behavior, being able to quickly locate the engineer who made this change, using a schema tracking mechanism, usually provides enough context to investigate further almost immediately. The following sections explore two key areas of non-metric database monitoring.

Schema change observability

Schemas define the architecture (tables and indexes, for example), logic (such as stored procedures and SQL queries), and constraints (like the primary keys and foreign keys on the tables) of a database system. Monitoring Data Definition Language (DDL) SQL transactions made to these components is vital to the integrity of a database, as it gives administrators better visibility into the respective downstream impacts. DDL transactions in SQL are the CREATE, ALTER, and DROP statements. For instance, if an engineer changes the name of a column in a table and causes an unexpected production failure, senior management can quickly pinpoint the source and reasoning behind the change, but only when schema change observability is integrated into their application. Let's look at a few examples of what you can track when monitoring database schemas:

Tables, indexes, or column creation, deletion, or modification (statements using CREATE, ALTER, and DROP)
Addition, deletion, or modification of indexes
Modification of views or stored procedures
Modifications made to primary or foreign keys

There are several methods for integrating DDL tracking into a database system. For instance, SQL Server allows users to create DDL triggers that are fired when certain events occur.

Let’s assume you want to create an audit trail for all changes made to stored procedures in your database. As an example, you can first create the following database table to record all actions taken on pre-existing stored procedures:

CREATE TABLE dbo.DDLEvents
(
    EventDate    DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    EventType    NVARCHAR(64),
    EventDDL     NVARCHAR(MAX),
    EventXML     XML,
    DatabaseName NVARCHAR(255),
    SchemaName   NVARCHAR(255),
    ObjectName   NVARCHAR(255),
    HostName     VARCHAR(64),
    IPAddress    VARCHAR(48),
    ProgramName  NVARCHAR(255),
    LoginName    NVARCHAR(255)
);

You can then create a DDL trigger that records relevant data to the above table whenever a stored procedure is created, altered, or dropped:

CREATE TRIGGER DDLTrigger_Sample
    ON DATABASE
    FOR CREATE_PROCEDURE, ALTER_PROCEDURE, DROP_PROCEDURE
AS
BEGIN
    SET NOCOUNT ON;
    DECLARE
        @EventData XML = EVENTDATA();
 
    DECLARE @ip varchar(48) = CONVERT(varchar(48), 
        CONNECTIONPROPERTY('client_net_address'));
 
    INSERT AuditDB.dbo.DDLEvents
    (
        EventType,
        EventDDL,
        EventXML,
        DatabaseName,
        SchemaName,
        ObjectName,
        HostName,
        IPAddress,
        ProgramName,
        LoginName
    )
    SELECT
@EventData.value('(/EVENT_INSTANCE/EventType)[1]',   'NVARCHAR(100)'), 
@EventData.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'NVARCHAR(MAX)'),
        @EventData,
        DB_NAME(),
@EventData.value('(/EVENT_INSTANCE/SchemaName)[1]',  'NVARCHAR(255)'), 
@EventData.value('(/EVENT_INSTANCE/ObjectName)[1]',  'NVARCHAR(255)'),
        HOST_NAME(),
        @ip,
        PROGRAM_NAME(),
        SUSER_SNAME();
END

Database migration tools like Liquibase also help track schema changes by keeping version-controlled migration logs (Changelog files) that contain all the historical changes made to your database.

Backup and replication monitoring

Like any other critical component of an application performance, databases need to be backed up regularly in case of data loss when unforeseen events like hardware failures or malicious attacks occur. The IT industry, broadly speaking, refers to this discipline as “disaster recovery.” Engineering teams usually set up automated backup jobs to ensure data is still recoverable in the event of production data becoming corrupted or unavailable.

How often these backup jobs run depends on the activity level of the database and the risk of data loss. Backup monitoring mainly involves tracking the status of these backup jobs by looking at factors such as backup success or failure, backup completion percentage, backup frequency, and backup error messages. Service-level agreements (SLAs) and service-level objectives (SLOs) are goals set between users and data teams to establish expected levels of performance and availability. SLAs are contracts between customers and service providers that define what service levels are expected, and what the consequences of not meeting these service levels are (penalties or payment negotiations). SLOs are the specific target levels that engineering teams must meet to comply with SLAs. For example, one of the SLOs of an SLA could be 99.9% database uptime. Efficient backup processes allow database teams to meet SLO requirements more easily and ultimately prevent revenue losses.

Similarly, recovery time objectives (RTOs) and recovery point objectives (RPOs) set expectations between customer and database teams about backup and recovery implications. An RTO sets the maximum amount of time between a service disruption of a database and the time at which it is satisfactorily operational again. The lesser the time value of an RTO, the higher the recovery cost usually is. RPOs, on the other hand, specify the amount of time between when data is backed up and when it is needed again, following an outage. Again, a lower RPO implies a higher cost due to the necessity of high-frequency backups. RPOs and RTOs should be set while keeping in mind both costs and the reasonableness of backup recovery. Ultimately, accurate backup monitoring allows database teams to more easily comply with SLAs, SLOs, RTOs, and RPOs.

Large organizations sometimes maintain multiple, synchronized copies of a database server to be accessible in production for various reasons, such as replacement in case of a failure and load balancing, such as moving the processing of big, monthly reports from the primary server to a secondary server. This is called database replication. Replication is a form of high availability (databases designed to have minimal or no interruptions), although it is slower and cheaper than maintaining multiple primary and secondary database replicas. The key aspect of replication to be monitored is whether database replicas are synchronized with the primary server. Monitoring systems would typically watch for replication lag (the period of delay between the time a transaction is committed on the primary and when it is synchronized on the replica server), replication throughput, replication error rates, and error messages.

Perform query analysis to identify bottlenecks

While query execution time is a latency metric we monitor, query analysis goes deeper to understand why queries are inefficient and what is needed to optimize them. Queries constitute a significant part of a database system’s total workload (usually 80% or more), as they are the building blocks of adding, updating, and removing data. Large applications that consume terabytes of data can benefit from even marginal improvements in query performance. Looking at a basic example, let’s assume an e-commerce organization uses the following SQL Server query to find the total sales for orders placed in 2024, in a database containing millions of rows of data:

SELECT SUM(OrderTotal) AS TotalSales
FROM Orders
WHERE YEAR(OrderDate) = 2024;

Assuming a nonclustered index exists on OrderDate, the above query prevents the database engine from using it, because the function YEAR() disables the index, thus causing a full table scan instead of a desired index seek.

On the other hand, if the organization framed the query as:

SELECT SUM(OrderTotal) AS TotalSales
FROM Orders
WHERE OrderDate >= '2024-01-01'
  AND OrderDate < '2025-01-01';

The presence of search arguments, often called “sargs”, enables the engine to use the index on OrderDate and retrieve the required data much faster, thereby significantly reducing response time. In this situation, a query analysis of the above script would immediately reveal that it is performing inefficiently.

Your query performance can be evaluated by metrics similar to general performance metrics, such as query execution time, CPU utilization, and query error rates. These database monitoring metrics are collected during execution.

Query execution plans

The first step to improving query performance is to understand how the query currently works. Because relational databases use a cost-based query optimizer, engineers are only able to see how the database engine executes the query after the fact. To facilitate this process, database engines offer execution plans, which dissect a query’s execution roadmap for users. Using our previously mentioned inefficient query to compute total sales, let’s take a look at how we can view an execution plan using PostgreSQL’s EXPLAIN ANALYZE keywords:

EXPLAIN ANALYZE
SELECT SUM(order_total) AS total_sales
FROM orders
WHERE EXTRACT(YEAR FROM order_date) = 2024;

We would see the following output:

Seq Scan on orders  (cost=0.00..4589.35 rows=200000 width=8)
                    (actual time=0.015..92.634 rows=200000 loops=1)
  Filter: (date_part('year'::text, order_date) = '2024'::double precision)
  Rows Removed by Filter: 800000
Planning Time: 0.120 ms
Execution Time: 94.218 ms

Immediately, we notice that the engine conducts a sequential scan (Seq Scan) of all rows. Execution plans like these can provide engineers with a quick and general overview of their queries, encouraging them to consider optimization strategies. However, PostgreSQL does not differentiate between scans and seeks in its execution plans.

Optimizing queries

The manual analysis and optimization of queries can become tedious and confusing, especially for organizations with numerous databases and large volumes of data. Monitoring tools reduce the burden on engineers to scrutinize their scripts and individual queries by automatically collecting data, comparing metrics, and providing recommendations. For example, SolarWinds Database Performance Analyzer provides users with an AI Query Assist (AIQA) feature, which helps you rewrite suboptimal queries in a matter of seconds using generative AI (GenAI), keeping in mind both performance and readability:

SolarWinds DPA’s AI Query Assist.

While such tools significantly simplify the query optimization process, it’s good, as a database engineer, to know a few best practices involved when it comes to improving query performance:

Indexing opportunities: When strategically implemented, indexes can drastically speed up query response times. However, they come with a storage overhead, and the cost-benefit tradeoff should be considered.
Caching mechanisms: Query-result caching and database-level caching can reduce repeated query executions for frequently accessed data.
Materialized views: Stored view results that can reduce response times for read-heavy queries
Query management: Periodic refactoring of queries to remove inefficient practices such as SELECT *, JOINs on large tables, etc.

Analyze trends to establish baselines and improve performance over time

Trend analysis goes one step further by utilizing metrics, collecting and analyzing data over time to provide database administrators with a deeper understanding of how their databases are evolving, in terms of performance and stability. Trend analysis is typically a long-term monitoring practice, rather than quickly examining metrics to resolve a bug or production issue. For example, let’s assume you manage the database for an online store.

A price-retrieval feature has begun performing slowly over the last 6 months. If you have been collecting trend data, you can examine how response times have changed for all queries involved over the past six months, and then investigate further to determine the cause of the performance deterioration. You could also monitor recurring patterns (eg, CPU spikes), retroactively correlate database structural changes with performance, and predict future database behavior and resource requirements. Trend analysis can be enabled simply by ensuring that the performance and health metrics you monitor are retained instead of discarded.

How to analyze trends

While different organizations have different trend analysis needs depending on their database sizes, business requirements, and application nature, analyzing trends involves a few common steps:

Collect metric data and store it in a database for chronological analysis. Typically, a minimum of three months of performance data is needed for a normal trend analysis. More data offers better long-term trending, but the amount of storage needed to store all of that data should be closely watched. Query Store is a popular SQL Server tool that stores query performance data over time, allowing engineers to analyze trends conveniently.
Set baselines and automate alerts when metrics start to deviate from historical averages.
Periodically view and investigate metric trends to identify any severe or concerning changes. Compare this data week by week, month by month, and so on, to see how performance has evolved in both the short term and long term. In general, look for metrics that significantly deviate from the norm.
Visualize trends using UI tools, such as Grafana or PowerBI, to help easily identify spikes or unusual patterns in trends.
Closely monitor data when database structure changes are made (eg, schema changes) to see if metrics react differently, in a historical context.

Predictive modelling

Once enough trend data has been collected, using historical data to predict future requirements for more complex systems and applications with a high growth rate can help proactively avoid future performance issues in the following areas:

Using historical growth data to understand when to amplify CPU resources
Predict query response times when introducing new features
Predict how an increase or decrease in database downtime will affect yearly user traffic

Database monitoring tools can significantly simplify the collection of metrics and trend analysis. For example, SolarWinds Observability Database Monitoring enables users to gain insights into database trends and visualize patterns with ease.

SolarWinds Observability Database Monitoring shows historical performance and trend anomaly detection.

Alert intelligently

Alerting is arguably the most useful aspect of monitoring databases, as it forms the key link between the detection of an issue and the database teams proactively firefighting or remediating production issues. Alerting releases pressure on engineers to constantly monitor database performance and availability while they work on other tasks. Alerts are typically set up to notify teams when metrics exceed threshold levels or exhibit other anomalous behavior. The benefits of having a comprehensive alerting system for your database systems are:

Address issues and bugs before they impact production environments
Maintain expected performance levels
Keep downtime to a minimum
Meet service level agreement (SLA) objectives
Decrease the average mean time to resolution (MTTR)

Some examples of alerts for metrics discussed in the above sections are:

CPU Utilization: Send an alert when CPU utilization exceeds 80% for over 4 minutes
Query Response times: Send an alert when the response time of a critical query exceeds 2 times the historical average
Traffic: Send an alert when traffic spikes 50% over the normal pattern for a particular period of the day or week
Availability: Send a critical alert when a database becomes unresponsive for more than 3 seconds

For example, let’s assume an e-commerce company has an alerting mechanism set up within its database monitoring system. During holiday periods, they want to ensure their checkout databases are reliable and that they are informed immediately of any potential issues, to avoid any revenue loss due to purchase failures. The database team decides to set up alerts for all queries related to business-critical functions, such as checkouts, purchases, user authentications, and product detail retrievals.

To avoid false positives and alert fatigue, non-relevant reporting and analytical queries are not configured to deliver these specialized alerts. These alerts are also only required during holiday periods (such as Christmas and Easter) and are enabled two weeks before the official holidays, on weekday evenings and weekends. To do this, engineers create multi-level threshold query latency alerts using the following reasoning:

Incident	Alert
Average query latency exceeds 200ms for more than 3 minutes	Send a warning alert by email and Slack
Average query latency exceeds 400ms for more than 1 minute	Send a warning alert by email, SMS, and create an incident ticket
Average query latency exceeds 600ms for more than 30 seconds	On-call DBA/senior engineers are given an automated phone call and asked to resolve the wait times immediately

The above query latency values are based on the average query latency during normal workload (100ms). Such multi-level alerting mechanisms can gauge the severity of the relevant performance issue and alert required management, as necessary. Monitoring tools remain the best way to set up and trigger automated alerts across your databases. SolarWinds Observability allows users to create customized anomaly-based alerts based on metric thresholds:

SolarWinds Observability anomaly-based alerts.

Adopt a continuous improvement approach to database monitoring

Database monitoring should not be considered a static process. Metrics, alerts, and trends should be used to periodically improve both database systems and the monitoring process. As applications and their databases evolve over time, older alerts and performance thresholds may no longer apply, while new ones may become relevant. Consequently, engineering teams should schedule occasional check-ups on their monitoring systems and update them with the newest institutional knowledge.

A logical overview of the steps involved in continuously improving database monitoring.

Five powerful enablers of continuous improvement for database monitoring processes are:

Clear baselines: Normal performance levels, availability, and health levels should be defined and reviewed after major database changes and improvements.
Documentation: Documenting monitoring practices, metric baselines, and trend data can establish guidelines for continuous improvement and also acclimate new engineers smoothly
Post-mortem analysis: Major incidents and outages can discourage database teams, but should be seen as learning opportunities to improve monitoring processes. A root cause analysis of metrics, alerts, and logs can enable alerts to detect or even prevent the same issue from occurring in the future, as well as point to where the monitoring process failed and how it can be improved.
Alerting: Just like metrics, alerts should be reviewed often. Performance impacts can change in terms of how severe they are considered over time, and alerts need to be adjusted accordingly to reflect the perceived urgency. Multi-threshold alerting can also be very useful with critical processes.
Automation: Focus on automating most monitoring practices. Automating performance and health checks is quicker for identifying underlying issues and anomalies, but also eliminates the potential for human error, which can potentially have business-critical impacts. In addition, many alerts enable automated responses. For example, when a user process causes a deadlock, an alert might simply cancel the deadlocked transaction and automatically restart it without the user even knowing an automated remediation was invoked.

Final Thoughts

Database monitoring is an integral part of maintaining and running a database system. The more important the database system, the more important it is to have a powerful and full-featured monitoring system in place. Monitoring performance metrics, health and availability metrics, schema tracking, query efficiency, and trend patterns can reduce production bugs and downtime and improve user experience, ultimately contributing to business success. A critical component of monitoring is alerting. A comprehensive alerting system ensures that engineers use metrics to improve their systems in a timely and efficient manner. Lastly, database monitoring processes should be consistently improved, especially with evolving and changing databases and large-scale applications.

Ready to achieve visibility over your entire IT estate?

Learn More