Runbook Template: Basics, Best Practices and an Example

SRE Guides

Runbook Template: Best Practices and an Example

Fundamentally, a runbook is a set of instructions that—when followed precisely— result in a system producing a specific outcome. For example, a runbook can define a process to restore a network device to a working state.

As modern IT infrastructures continue to grow in complexity and scale, triaging potential incidents becomes more and more time-consuming. Runbooks help reduce MTTR by providing engineers with a proven recovery path, and automation helps scale the benefits. A platform-agnostic runbook template provides process stability and reliability, and an automation strategy can provide the confidence and repeatability needed to recover quickly.

This article will explore runbook templates and provide ways to avoid potential issues.

 

Runbook Template Basics

The goal of a well-structured runbook template is to be concise, providing readers with the details and context needed to complete a task without overloading the document with unnecessary details.

The table below details the key components of a quality runbook template.

Runbook Template Components

Runbook Component

Description

Example

Task ID

This is usually a reference and a link to  a ticket created in the organization's project management system or incident board (Jira, Asana, Trello) informing readers where to search for more information and where to log details relating to the runbook's execution.

INC-101

Task Name

A quick description of the task (two to three words).

Employee Offboarding

Task Description

A longer description of the task. This doesn’t need to go into detail and should not specify how the task should be performed at a technical level.

Employee has been dismissed for misconduct and needs to be removed from all relevant systems.

Task Details

Steps required to execute this task. This is the core of the runbook. Each detail or step should be outlined in a simple format. The required action should be described, the reason for the action should be described, and if required, a step on how to validate and, or troubleshoot the step.

Step 1. Power on the machine.
Step 2. Input credentials 
Step 3. Power off the machine.

Team executing this task

Team responsible for this task.

DevOps

Task Owner

Team member responsible for executing the task or coordinating the team.

Alice@example.com

Time to complete this task

Particularly useful for tasks that will affect production systems. There should be an expected value provided alongside an actual value when the action has been completed.

Estimated time:
10 - 20 minutes
 Started: 11/11/22 11:00:00
 Completed: 11/11/22 11:11:00

Status

A status provides all stakeholders with insight into the issue or task in question.

ASSIGNED, IN_PROGRESS, BLOCKED, or COMPLETE

Triggering a Runbook

The first iteration, or the first few iterations of a runbook, will likely be triggered by a manual process. For example, tasks to recover a website that has crashed or offboard an employee for HR should be performed by a human before being automated.

As the process improves, the runbook may be triggered via an application programming interface (API) or ticketing system. Cloud monitoring solutions like Amazon Web Services CloudWatch are great examples of services that can detect issues in a production system, highlight them using interactive graphs, and even trigger an automated response. As the runbook evolves, automated responses can take over some of the responsibilities from the engineer in charge of executing it, potentially automating the entire process.

A monitoring solution can be separate from a particular technology or provider. Custom solutions require more effort but can be as effective. Examples include a basic out-of-the-box graphing solution like Grafana to a MySQL database, or a complex Python script that builds an entire secondary region architecture and tweets when complete.

 

A Runbook Example

As an example, we’ll use the case of an employee whose contract has been terminated for misconduct. The company has outlined the steps IT should take once they receive the email notifying them of the termination. This set of steps is essentially a runbook. The job of the IT team is to document this process and provide instructions clear enough to empower a repeatable and reliable result.

Task ID

ACME-INC-108

Task Name

Employee Offboarding—John Doe 

Task Description

The employee has been dismissed for misconduct. Active credentials need to be revoked, users need to be offboarded from all internal systems, and recent activity needs to be reviewed.

Task  Details

For full details, see the instructions below:
 • Disable the user’s account from the Acme management portal
 • Remove from GitHub
 • Revoke AWS keys
 • Download activity logs from the Acme management portal
 • Download activity logs from AWS
 • Store activity logs
 • Audit activity log

Team executing this task

DevSecOps

Task Owner

joe.bloggs@acme.com

Time to complete this task

Estimated time: 40 - 60 minutes
 Started: 01/11/22 14:20:00
 Completed: TBD

Status

IN_PROGRESS

  1. Disable the user account for the internal system: The former employee has access to the internal sales and marketing system, and their credentials should expire and, or account deleted so they can no longer access confidential information.
  2. Disable their GitHub account: The former employee is part of the company's GitHub organization. They should be removed from the organization as soon as possible so they can no longer access intellectual property like source code.
  3. Disable their Amazon Web Services (AWS) keys: The former employee has access to the AWS system as they require database access from time to time. Their AWS keys should be revoked so that they can no longer access the AWS infrastructure.
  4. Download their usage logs from both AWS and the internal system: The company would like insight into what actions the former employee took in their final days to check no malicious actions were carried out. This includes AWS CloudTrail logs to gain insight into activity on AWS and the activity logs from the internal system to gain insight into what data the former employee accessed or modified before leaving.
  5. Store their usage logs in S3: The data gathered in the previous step should be stored in S3 so it can be easily reviewed. Findings from the review can be validated later.
  6. Investigate/audit usage logs: Once the data has been stored in S3, it should be reviewed for malicious or suspicious activity. This could include accessing or modifying resources not usually associated with the employee’s role, or even unusual log-in times, which could indicate suspicious activity.

Given these requirements, the IT team is charged with running through the process in detail and documenting the actions required to accomplish each objective. Their deliverables are a well-documented, minimal set of easily reproducible steps to be added to the Task Details section of the runbook.

 

Automating the Runbook

This simple example of a runbook requirement may seem trivial, but even a small mistake in executing the actions could lead to disastrous results. The phrase “I’m only human” is common. Humans make mistakes. That's an inevitability that should be taken into account when creating runbook steps. Screenshots or diagrams to accompany complex instructions can help, but automating the task is ultimately the most reliable way to ensure a predictable result.

Let's break it down step-by-step, using the example above, and see how to automate the process using a script or set of scripts. The task details would  become simpler and point the reader to the script(s) to run, explain how to run them, and advise how to validate their success.

  1. Disable their user account for the internal system: Most modern web applications will contain a REST API that can be programmatically invoked via simple scripts. These scripts can trigger most actions (potentially more) than those via the front-end user interface. The start of our automated solution would involve a call to an API to disable the user’s account.
  2. Disable their GitHub account: GitHub is an example of a web application with such an API. Similar to step one, our script can make a call to the GitHub API to remove the user from the organization.
  3. Disable their AWS keys: Automated solutions are a large part of the AWS ecosystem. To empower its users it provides an interactive API using software development kits (SDKs) written in multiple different languages, as well as a command line interface (CLI) that can perform almost any action offered by various AWS. We can use the API or CLI in our script to revoke the user’s keys programmatically.
  4. Download their usage logs from AWS and the internal system: This step comprises two more API calls. First, we can invoke AWS CloudTrail to download the AWS user logs. Then invoke our internal systems API to download any relevant user activity.
  5. Store their usage logs in S3: Again, a simple API call. The AWS SDK and AWS CLI allow you to copy files to and from Amazon's simple storage service.
  6. Investigate, audit usage logs: This can be done in different ways, but a simple script that searches the logs for certain words or patterns can quickly detect unusual activity. As the runbook evolves, this script may also evolve and link into custom machine learning services that can learn and detect suspicious patterns.
A script that can automate our example runbook.

A script that can automate our example runbook

 

Recommendations for Designing a Runbook Template

A runbook is seldom perfect and may take time to reach an ideal state. In fact, it may never reach a final state and simply continue to adapt and evolve. Below are some runbook template recommendations that can help you get the most out of your runbooks.

Don’t try to Automate Everything on Day One

Attempting to script every step from the outset can lead to confusion and mistakes. It’s important to perform the task manually at least once to fully understand and explain the process being automated.

Document Clearly

Use screenshots and diagrams to help illustrate clearly so that a reader can follow the process, confident that everything is executing as expected.

Remember to Validate

Once the runbook has been followed, you should validate that the system is in the desired state. In some cases, this can be a single check. In other instances, validation may be necessary on a step-by-step basis. Validation steps should be included with the runbook steps.

Know How Much Automation Is Too Much

Think about the consequences of automating the runbook. Sometimes manual intervention might be necessary to trigger the automation process.

For example, a temporary network blip may trigger a response to spin up a production infrastructure in a secondary region and switch all production traffic to this region. In such a costly and time-consuming case like this, identify whether the brief issue is a cause for concern or if the customer is satisfied with the recovery time.

 

Conclusion

Runbooks are invaluable to a growing enterprise. As a solution grows, it’s inevitable that things may go wrong. Using a quality runbook template can bring order to the chaos of solution engineering. By following a familiar structure, the runbook reader can put aside the stress of reinventing the wheel, overengineering a solution, or preparing a business-ready document. With a runbook in place, all they need to do is follow the steps.

A structured format opens up the possibility of process automation. Recommended steps are easier to automate, either via third-party solutions or custom scripts, when they are more refined and reliable. Some companies have begun to realize the importance of establishing this structure quickly and have built runbook solutions to provide out-of-the-box runbook templates that can save months of trial and error.

Ready to Transform Your Organization’s Incident Response?