Latest In

SRE

10 Signs Your Organization Needs an Incident Management Tool
October 11, 2024
Vishal Padghan
In the world where digital infrastructure forms the backbone of operations, incidents—disruptions to service, system downtime, security breaches, or technical failures—are inevitable. For any organization that depends on technology, the…
The Guide to SRE Principles
March 23, 2023
Squadcast Community
Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates…
The Evolution of Incident Management from On-Call to SRE
March 7, 2023
Vardhan NS
Importance of Reliability While the number of active internet users and people consuming digital products has been on the rise for a while, it is actually the combination of increased…
The Critical Role of Observability in SRE
December 3, 2021
Ricardo Castro
Observability is the practice of assessing a system’s internal state by observing its external outputs. Through instrumentation, systems can provide telemetry such as metrics, traces, and logs that help organizations…
How to improve your influence as an SRE
November 10, 2021
Ricardo Castro
Balancing fast-paced business requirements with the demands of keeping production services stable is not an easy task. SRE is an opinionated implementation of DevOps and is defined by Ben Sloss,…
Going from Zero to SRE
September 14, 2021
Ricardo Castro
Traditionally, developing applications and running them in production was seen as completely separate worlds, usually being the focus and concern of different teams. This kind of separation gives birth to…
Demystifying DevOps and SRE
August 4, 2021
James Samuel
One of the terms that people often find confusing is SRE and DevOps. People often ask, should I hire a DevOps Engineer or a Site Reliability Engineer? What is the…
Reduce Toil with Better Alerting Systems
April 7, 2021
Biju Chacko
Are you an SRE or On-call engineer struggling to manage toil? Toil is any repetitive or monotonous activity that can lead to frustration within an incident management team. Also at…
Overview of Incident Lifecycle in SRE
February 17, 2021
Biju Chacko
Service disruptions are inevitable, but each incident offers a chance to learn and improve. This blog delves into best practices for managing incidents throughout their lifecycle, aiding teams in building…
Error Budgets and their Dependencies
February 3, 2021
Adam Hammond
In our last few articles, we’ve discussed SLOs and how important picking them correctly can make or break for your application’s performance. Today we’re going to cover error budgets, which are…
12