Service Level Agreements Vs. Service Level Objectives: Tutorial and Examples

SRE Guides

Introduction

Service level agreements (SLAs) and service level objectives (SLOs) are increasing in popularity as site reliability engineering (SRE) concepts because modern applications rely on a complex web of sub-services such as public cloud services and third-party application programming interfaces (APIs) to operate, making service quality measurement an operational necessity for serving a demanding market.

This article focuses on the similarities and differences between SLOs and SLAs in SRE. It describes the intricacies of implementation, presents a case study, and recommends best practices for the implementation.

 

Summary of Key Concepts

The term SLA has taken on different meanings over time. Some companies define SLA as the service quality clause in a contractual agreement and refer to SLOs as the measurable objectives that substantiate the SLA. In this article, we adhere to the Google definition in the context of SRE practices, summarized in the following table.

Service-Level Indicator (SLI)

SLIs are metrics such as latency and error rate used to measure service quality.

Service-Level Objective (SLO)

An SLO is a target value (or a value range) for service quality as measured by an SLI.

Composite SLO

The resulting SLO when combining sub-services with varying levels of SLO.

Error Budget

The amount of time a system can fail or the number of errors it can sustain before causing an SLO breach.

Service-Level Agreement (SLA)

An SLA is an agreement between a service provider and its users that stipulates SLO guarantees and penalties payable upon for non-compliance.

Industry Context for Service Level Agreements versus Service Level Objectives

Before the proliferation of software as a service (SaaS) and hosted applications, SLOs and SLAs were mostly used in the IT industry by telecom carriers when offering data services, such as internet access, committing to service quality metrics such as 99.99% availability, or a minimum bandwidth of 50MB.

The principles haven’t changed for software and infrastructure service providers, but the metrics focus instead on indicators such as latency and error rates. For example, an API may have an internal service quality objective to process a minimum of 100 requests per second 99.99% of the time, with an error rate of less than 0.5% and a query response time of less than 200 milliseconds.

The contractual SLA commitment may be based on fewer indicators for the sake of simplicity and use more conservative values, such as guaranteeing that the average response time calculated over an hour won’t exceed 300 milliseconds as conceptually illustrated in the diagram below.

Industry-Context-for-SLA-vs-SLO.png

The software industry has come a long way defining and implementing SLOs. For instance, the Open SLO project helps companies configure vendor-agnostic SLOs using YAML files in the same way that DevOps teams use them as part of their continuous code delivery processes. In another example, Squadcast has open-sourced its internal SLO tools, known as the SLO Tracker, to help the SaaS industry improve software stability services. In another sign of industry maturity and cooperation, SLOConf is a community resource for learning about vendors developing new tools and services for implementing SLOs.

 

Service Level Agreements versus Service Level Objectives: Concepts Explained

 

Service-Level Indicator

SLI metrics measure service performance, accuracy, and availability. The core SLI metrics for mobile and web applications are uptime, latency, error rate, and throughput. One service can have multiple service level indicators. The table below provides a list of common SLIs.

Common Service Level Indicators

Definition

Availability
(or uptime)

The percentage of time the service has been fully functioning and available to users over a time interval (e.g., 99.95% of the time over a 24-hour period).

Latency

The time it takes for a web page or an application programming interface to return a response to a request (e.g., 200 milliseconds).

Error Rate

The percentage of the requests resulting in an error over a period of time (e.g., 0.1%). An example of an HTTP error is a 404 code meaning a page was not found.

Throughput

The capacity of an API to support requests typically expressed in terms of requests per minute (RPM). In networking, the throughput is measured in MB per second.

Mean time between failure (MTBF)

The average amount of time separating two consecutive failures (e.g., five days, day hours, and 34 minutes)

Mean time to repair (MTTR)

The average time it takes the service provider to remedy a service failure (e.g., one hour)

Service-Level Objectives

SLOs are targets set by DevOps teams for measuring service quality based on an SLI. For example, a service may aspire to be available 99.99% of the time, or limit errors (such as an HTTP 500 error) to less than 0.5% of the time.

SLOs are increasing in popularity because they provide multiple benefits. They:

  • Define measurable customer-centric criteria for service quality
  • Help teams collaborate better by having a common understanding
  • Rally an organization to improve by establishing stretch goals
  • Avoid disputes with clients rising from subjective expectations
  • Define the reliability of sub-services used as application's building blocks 
  • Set an application’s reliability expectations from infrastructure resources

 

Service providers often target a more aggressive SLO value internally compared to the value published for end-users. For example, a service provider may require its SRE team to deliver a service availability of 99.99% while only advertising an SLO of 99.9% to its end-users. The difference between the two SLO values is viewed as a safety buffer for execution.

Composite Service Level Objectives

Modern applications rely on a multitude of independent services to operate. For example, a web application requires its front-end web server farm to run in conjunction with the back end services including a database service. However, a web application often won’t function properly unless the content delivery network (CDN), and domain name service (DNS) are also fully operational.

Before a service provider contractually commits to an SLO, it must consider the SLOs from all its constituent services and calculate a composite SLO.

The value of a composite SLO is calculated by multiplying the SLOs of their sub-services, which may not seem intuitive at first. This formula is derived from the compound probability theory of two independent events occurring at the same time.  

In the example below, the application’s composite SLO is 99.899%, based on the following mathematical multiplication formula: 0.999 (SLO of service A) x 0.99999 (SLO of service B) = 0.9989901 (SLO of the application service).

A composite SLO is calculated based on the SLOs of its supporting sub-services.

A composite SLO is calculated based on the SLOs of its supporting sub-services. 

 

Error Budget

An error budget is an amount of acceptable buffer before an SLO is breached. For example, an uptime commitment of 99.9% per month means that a service can be down 43.83 minutes in a month without breaching the SLO. Suppose a service suffers 30 minutes of downtime during the first 15 days of a month. This leaves 13.83 minutes of error budget that the operations team can afford to spend before it fails to meet its objectives.

Error budgets are traded off against the pace of innovation. In other words, a high velocity of code release in a production software environment supports innovation but causes instability. Companies that measure error budgets can course-correct their strategies mid-month. For example, if they have sustained outages in the early part of the month, they would instead focus their efforts on testing and documentation so as to reduce consumption of the Error Budget during the latter half of the month.

Service-Level Agreements

SLAs represent an agreement between a service provider and end-user that establishes service performance, accuracy, and availability standards (we refer to them collectively as service quality in this article) based on SLOs. SLAs by definition involve a contractual obligation to the customer upon breach of the committed SLO values.

Service Level Agreements versus Service Level Objectives Case Study: An Internet Service Provider

Let’s consider an internet access service to describe, in practical terms, how SLAs and SLOs are offered and implemented by a service provider.

Take, for example, a dedicated internet access service. Dedicated access is a significant investment that requires the installation of on-prem equipment and usually a multi-year contract.

In return, the Internet Service Provider (ISP) promises higher availability and throughput compared to a shared internet service. If the ISP violates this agreement, the contract's SLA will involve a penalty resulting in service credits or a refund.

The Contractual Terms and Service Level Agreement

The list below shows the terms and conditions of a typical contract. The first few points establish the rules of engagement, followed by an SLA clause that makes quantitative guarantees:

  • The ISP shall work closely with the customer to coordinate any outage or maintenance requests initiated by either party to ensure minimal network downtime
  • The ISP will provide a minimum notification of five days before any scheduled maintenance window involving planned service downtime
  • All scheduled maintenance windows will occur between 12 a.m. – 6 a.m. or at a time agreed on in advance
  • The ISP shall provide customer support 24 hours a day, seven days a week
  • The ISP shall provide services that meet the SLA conditions presented in the following table. The remedy for a violation of these service level standards will be a credit equal to 1/10th of the Monthly Recurring Charge (MRC) for each month the service provider has not satisfied the following service level standards.

 

Service Availability

99.99% (or a maximum downtime of 4.38 minutes per month)

Throughput as measured by https://www.speedtest.net/

> 50MB per second

 MTTR upon an outage

Two hours

Implications of Committing to a Service Level Agreement

An ISP must invest in an infrastructure architecture designed for high availability to sustain its service through standard equipment and infrastructure failures. High availability requires fiberoptic and networking equipment to be redundant, but the infrastructure must also have redundant power supplies and switches to handle hardware failures.

With all of this planning, service interruptions will still inevitably occur. Outages could take the form of failed maintenance or an underground cable break due to accidental construction.

A typical ISP would establish an internal SLO, leaving a margin of error for its engineering and operations teams. In our example, supporting an SLA of 99.99% (4.38 minutes of allowed downtime per month) may require an SLO of 99.999% (26.30 seconds per month). The reasoning is that strict internal SLOs give the provider the best chance to catch and mitigate issues before they result in SLA violations.

Ultimately, the engineering team may decide that offering such a high-level SLA requires excessive capital investment, and the team may convince the legal department to consider a lower level of contractual commitment, such as 99.95%. This translates to 4.38 hours of acceptable downtime per month instead of 26.30 seconds.

The table below shows how each ‘nine’ places a significant operational burden on the service provider’s engineering and operations teams.

Availability %

Downtime per Year

Downtime per Month

99.90%

8.77 hours

43.83 minutes

99.95%

4.38 hours

21.92 minutes

99.99%

52.60 minutes

4.38 minutes

100.00%

26.30 minutes

2.19 minutes

99.999% (“five nines”)

5.26 minutes

26.30 seconds

Best Practices for Service Level Objectives versus Service Level Agreements

Consider the following recommendations when planning to introduce a new SLO or SLA:

Plan Ahead

Introducing SLAs typically requires months of planning, testing, and upgrading tools and processes. Business stakeholders like sales and legal departments should collaborate with stakeholders from engineering, support, and operations organizations to create a well-defined SLA support plan and practice responding to incidents using internal SLOs.

Be Transparent

SLAs often remain buried as a clause in a legal contract with the hope customers forget and don’t request a refund upon breach. SLAs displayed on a public service status page help align a provider’s operations with the expectations of its clients.

Some providers go as far as displaying the SLOs (used in the SLAs with clients) on physical monitors in their offices to embed them in the company’s culture.

Keep it Simple

Choose SLOs that are as simple as possible with clear service level indicators that can be easily monitored and calculated. It’s best to start with only one SLI.

In practice, even simple SLO calculations can get complicated. For example, if an application is performing well (less than 500 milliseconds of access time for most of the web pages that make up the application’s user interface), but one of its reports is generating slowly (taking two minutes due to the large size of the data covered by the report combined with a sub-optimized database query). Does this scenario constitute a breach of SLA? The service provider would say no, but a user of that particular report would disagree.

Measure Objectively

SLAs should be measured using a third-party testing tool outside of the company’s network to simulate the behavior of end-users who reach the platform from a remote location. An example might be a ping test conducted by a third-party testing provider with globally distributed locations.

Mind the Timing

SLOs and SLAs are based on average measurements during an hour, a day, or a month. However, the timing of the outages, slowness, and errors contributing to SLO degradation are equally important. For example, two services may meet the SLO of 99.9% uptime by having no more than 43 minutes of downtime in a month; however, one of them had the outages late at night on weekends, and the other, mid-morning on weekdays, resulting in different customer satisfaction outcomes. Some DevOps teams avoid releasing codes or making certain types of configuration changes during peak business hours to reduce the risk levels affecting service quality.

Require Detailed Support Tickets

Customers expect rapid service restoration but may not provide enough information about the problems they are experiencing. For example, an application may be slow in certain regions, and only from mobile devices, while all other locations operate normally from desktops. It’s important that customers file support tickets and report problems that could result in SLA penalties. Support tickets should include mandatory fields for providing information, such the OS version and browser version of the platform where the problem was experienced and include screenshots or browser logs. The more information a service provider has, the more likely it is to shorten its mean time to repair (MTTR) and meet its SLA obligations.

Start With a low Service Level Agreement Commitment

It’s best to start with a lower level of commitment, even if it’s not the industry standard. This approach gives your teams time to adjust. For example, if your competitor offers a 99.99% commitment, start with 99.9% for the first few months.

Make sure your internal processes and architecture support the SLA before increasing your commitment to 99.99%. Your operations team will appreciate the difference between 43 minutes of permitted downtime per month and four minutes until they are used to regularly enforcing the SLA.

 

Conclusion

Establishing SLOs helps organizations drive towards a common measurable goal and reach the level of client satisfaction needed for a company to prosper. It’s best to start measuring and privately sharing SLOs inside a company for months, or even years before contractually committing to customers. Start simple to give your company time to evolve the processes, tools, and service architecture necessary to honor legally binding commitments.

Ready to Transform Your Organization’s Incident Response?