Leading Incident Management Best Practices
Introduction
Key Incident Management best practices
Best practice | Description |
|---|---|
Have the right people assigned to the incident | An established incident management team of top talent from various departments is essential. |
Establish proper communication channels | Have a communication plan for internal and external communications and decrease noise by only including the necessary people. |
Disseminate information on incident status to a predefined list of people
| Use tools to detect and report incidents. |
Define what is considered an incident for your organization | Know what an incident is for your organization. For example, an incident is typically something like a server outage, not high CPU usage; data loss, not delayed backups. It may also be any security breach. |
Identify an incident manager | Assigning someone to coordinate and facilitate communications during an incident is vital for the task force’s effectiveness. |
Build a solid knowledge base and extend it as required with each incident | Easily navigable, up-to-date knowledge bases are important. |
Know your SLOs and keep an eye on your SLAs | Ensure alignment between the expectations set by SLOs and the commitments made in the SLAs. |
Automate everything possible, and have runbooks where human intervention is needed | Automate time-consuming tasks to save valuable time for your incident team. As you gain experience, promptly identify and implement more automation opportunities. |
Note actions taken and conclusions made as they happen | Collecting information when the incident occurs and creating documentation as the lifecycle progresses simplifies the post-mortem. |
Keep actions blameless | A blameless culture reduces anxiety in teams and individuals, improves collaboration, and retains talent. In a transparent environment, people tend to be more accountable. |
The lifecycle of an incident
Top ten incident management best practices in detail
Incident | Problem |
|---|---|
Server outage | Unusually high CPU usage |
Data loss and restoration failure | Delayed backups |
Security breach | Expiration of credentials in a non-production environment |