SRE (Site Reliability Engineering) Best Practices

SRE Guides

Introduction

Site Reliability Engineering (SRE) is a practice that emerged at Google because of the need for highly reliable and scalable systems. SRE unifies operations and development teams and implements DevOps principles for system reliability, scalability, and performance.

There’s plenty of documentation on tactics for adopting automation and implementing infrastructure as code, but practical ops-focused SRE best practices based on real-world experience are harder to find. This article will explore 6 SRE best practices based on feedback from SREs and technical subject matter experts. Here is a list of topics we will cover.

SRE Best Practice

Benefit

Define the role of the SRE

Removes role ambiguity and clarifies responsibilities

Automate toil and make time for strategic tasks

Emphasizes automation of simple tasks and enables humans to focus on more complex work

Monitor using Service Level Indicators (SLIs) and Service Level Objectives  (SLOs)

Improves visibility and helps determine if Service Level Agreements (SLAs) are met

Maintain a transparent status page

Summarizes infrastructure performance and availability for all stakeholders

Categorize incident severities

Helps quantify incident impact and prioritize incident management tasks

Conduct post-mortems and share them publicly

Encourages transparency and continuous learning.

 

Define the Role of the Site Reliability Engineer

The SRE has several responsibilities:

  • Designing systems to monitor, automate, and achieve the highest uptime with the lowest operational effort
  • Enabling developers to iterate and move fast simultaneously
  • Incident management
  • Performing root cause analysis (RCA)
  • Conducting post-mortems (more on these later in the article)
  • Creating documentation to minimize tribal knowledge

SREs should spend most of their time automating tasks to avoid having to constantly ‘toil.’ Toil is a catchall term for operational tasks that involve repetitive manual configuration or lack long-term strategic value. Without automation, toil requires the engineering team’s time. With automation, engineers can focus on more complex tasks.

This diagram represents the optimal time allocation for an SRE.

SRE-Team-Capacity.png

The two categories of SRE work. (Source)

 

Automate toil and leave time for strategic tasks

To avoid wasting valuable engineering time, SREs should work on automating every repetitive task, so teams focus less on toil and more on innovation. SREs use scripts, programs, and frameworks to automate and monitor those tasks.

Within high-performing teams, eliminating toil is a core SRE function. From a tactical perspective, there are many ways to implement this best practice. The key is to limit wasting human time spent working on simple things automation can handle.

 

Monitor Using Service Level Indicators and Service Level Objectives

Effective monitoring is a crucial part of SRE. Metrics should be as close to the user as possible since most businesses care more about user experience. Organizations should define their most important metrics which SREs use to build the three key indicators: SLIs, SLOs, and SLAs.

Service Level Indicators Versus Service Level Objective versus Service Level Agreements

Key Indicator

Goal

Stakeholders

SLIs

Collect metrics in a standardized way to gain insights into the system's performance

Development and product team

SLOs

Set the uptime objective for the company

Development team, product team, and company executives

SLAs

Set expectations for the general public about the reliability of your services

Clients, consumers, and the general public

SLIs are used to collect metrics in standardized ways. Here is a breakdown of common SLI types.

Service Level Indicators

Common SLI Types

Type of SLI

Description

Availability

Percentage of requests that resulted in a successful response

Latency

Percentage of requests that returned faster than the minimum threshold

Quality

Percentage of requests that were served in a non-optimal manner due to service affectation

Freshness

Percentage of data that was successfully updated under the minimum threshold

Service Level Objectives

SLOs are the goals the organization must accomplish and are formulated using the SLIs explained in the previous section. These should be published internally in a place easily accessible for technical and non-technical stakeholders.

Service Level Agreements

SLAs are contracts with clients and consumers about what to expect from the service, usually legally binding, bearing financial implications if not met.

 

Maintain a Transparent Status Page

Customers need to understand systems’ status at all times. If there's an outage the customer has to know about it as soon as possible. This helps build trust and prevents  troubleshooting an issue beyond the customers’ control.

Status pages reflect the status of services in real-time. They should be clear and concise and have a color-coded indicator for each service exposed to customers. In case of failure, status pages should immediately report which services are failing and why. It is always good to accompany these reports with an email or RSS notification.

Squadcast_status_page.webp

The Squadcast status page at https://status.squadcast.com.

 

Categorize Incident Severities

With enough time and complexity, errors happen. When they do, they must be addressed in an organized manner.

Incidents have different severities: P0, P1, P2, and P3. Severity determines the action to be taken and response time.

Severity Levels

Severity

Examples

Action

Response time

P0 (Critical)

The site is unavailable for one or several reasons: DDoS attack, wrong configuration, bad deployment, or third-party incident. It can also be related to a security issue, such as PII exposure.

Page (push to on-call, call to action, email, Slack, War Room). Most of the time, several teams are involved, with multiple stakeholders. Engineers do RCA in real-time.

Immediate (within five minutes)

P1 (Major)

The site is partially affected due to one or more services failing or a provider incident. This issue could also be intermittent.
 

Page (push to on-call, email) usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience.

Fast (within 20-30 minutes)

P2 (Minor)

Some of the site’s non-critical functionalities are affected, like recommendations not loading correctly, images not showing up, or loading too slowly.

Slack, email, and notify a single team. See if there is easy remediation and if there is a fix to be applied. If not,  it could be delayed until the next working day. An item should be placed in the backlog and prioritized accordingly.

Standard (within a few days)

P3 (Irrelevant/Bug)

The incident is not affecting users directly, or users may not even be aware of it, like an elevated error rate, which the client applications retries on.

Notification channels may include email or Slack, but do not require an immediate response. SREs should review during working hours.

Slow (within a few days or weeks)

Conduct Post-Mortems and Share Publicly

Shortly after an incident, SREs should do two things:

  • Address the issues: Critical errors are patched or hot-fixed in an improvised way, although this is not usually a permanent solution. If that is the case, critical errors should be placed in the backlog to be revisited by the development teams. SREs should also review issues not fixed on-call during working hours.
  • Draft a post-mortem: Post-mortems are a briefing on what happened during the incident. These help collect the information from the incident: what happened, why, and how to prevent future issues. Every post-mortem should have clear documentation and action items placed in a backlog and prioritized according to the severity.

It is also a great idea to share post-mortems with the public since it helps bring transparency and strengthen trust.

Ready to Transform Your Organization’s Incident Response?