Mastering ITSM Incident Management: A Comprehensive Guide

ITSM Guides

On Friday, 19th July 2024, over 8 million Windows devices worldwide experienced the blue screen of death (BSOD) due to a CrowdStrike update. The BSODs disrupted flights, banking, media, hospital operations, and many other critical sectors. A problematic CrowdStrike configuration was identified as the culprit. This high-profile example of the impact of IT incidents helps drive home the importance of robust incident management.

The incident management practice is a fundamental element of IT service management. It aims to minimize the negative impacts of incidents (unplanned interruptions or reduction in quality of service), by restoring normal service as quickly as possible. As per ITIL 4 service management best practice, quick restoration is a key factor in user and customer satisfaction, the credibility of the service provider, and the value an organization creates in its service relationships.

This article will walk through the key stages of the incident management practice and highlight some best practices for implementing an effective process.

Summary of key ITSM incident management stages

The table below summarizes six key ITSM incident management stages this article will explore in detail.

	Concept		Description
	Incident registration		Incidents can be raised manually, by service desk operators, or automatically via monitoring software.
	Incident categorization		Proper categorization and prioritization enable effective routing and assignment.
	Incident response		Timely and concise notifications keep stakeholders informed of incident status.
	Incident investigation		Incident symptoms and sources are analyzed to determine causes and inform resolution efforts.
	Incident resolution		Focus on speedy restoration through the use of workarounds and swarming techniques.
	Incident review		Regularly reviewing incident volume, SLA adherence, and SOP effectiveness improves performance.

Six stages of ITSM incident management

ITSM incident management practices help teams structure and streamline their incident response processes. The sections below explore each stage — from registration to review — with information on what teams should do at each stage.

Incident registration

IT service failure or degradation can quickly impact user productivity and create business risks such as reputational damage or lost revenue. Early incident detection can significantly reduce the negative impacts created by a service failure. Teams typically detect incidents with these two approaches:

System monitoring: An event is detected by an IT monitoring tool and identified as an incident based on a predefined classification.
User reporting: The user detects an IT service malfunction and reports it to the service provider through agreed channels such as telephone, email, or logging a ticket on a service management portal.

Apart from these standard approaches, organizations use techniques such as tracking negative social media mentions and analyzing abnormal patterns of service interactions (e.g., failed checkouts in an e-commerce app) to detect service incidents.

When done accurately and quickly, incident detection is a crucial indicator of service excellence. IT service providers are deemed effective when they inform customers about service issues rather than users notifying IT about an incident.

However, user-reported incidents are still common. When a user reports a detected incident, the service management agent receiving the report (first point of contact e.g. service desk agent or customer service representative) should conduct a triage to confirm that it is an actual incident since there are cases such as queries, planned maintenance, or lack of knowledge that may be misconstrued as incidents.

Once an incident has been detected, it has to be registered so that a permanent record tracks the incident handling information throughout the process and serves as a future reference when referring to the issue. Incident registration involves adding information to the detection report, either by the service provider’s agent manually populating the incident record with received data from users or monitoring systems or the incident record being automatically populated by predefined technical data by the monitoring tool. For automatic registration, a notification is usually sent out to technical specialists.

An incident record contains an identifier for referencing the record and associating it with other ITSM records such as configuration items (CIs), problems, and change requests. Other pertinent information to be captured during incident registration includes the date of occurrence, the person reporting it, the description of the issue and its effects, and any associated screenshots or error logs (example in the image below). The quality of information captured during incident registration can significantly aid in speeding the incident investigation and resolution activities.

An incident log in SolarWinds Service Desk. (Source: SolarWinds)

Incident categorization

Once an incident is registered, the service provider’s agent — the first level of support — performs an initial categorization or classification. This involves quantifying the incident’s impact and determining the urgency to resolve it. In an automated setup, the categorization may be performed based on pre-configured rules.

While IT may want to treat all issues with the utmost urgency and attention, that is not practical given resource constraints and business priorities. Therefore, categorization informs the level of responsiveness based on predetermined qualitative scales for both urgency and impact, which are generally mapped to priority as shown in the examples below from the Unified Service Management (USM) Architecture guidelines:

Impact, Urgency, and Priority tables. (Source: USM Architecture)

Categorization also aids in determining if the incident has ever happened before, the appropriate team to be assigned for resolution, and whether the resolution activities need to be coordinated by a separate team such as a major incident team or a computer emergency response team. The timelines for responding to and resolving a particular incident category may be informed by contractual obligations, service level agreements, or internal service specifications drawn out by the service provider. The categorization and subsequent resolution times may change as new information is discovered or shared with the assigned incident resolution team.

Centralize. Collaborate. Resolve.

Focus on fast, effective resolutions. Equip your team with the tools they need to diagnose, collaborate, and provide resolutions with speed and precision.

Try Service Desk

Incident response

According to the VeriSM service management framework, transparency is crucial in handling incidents to build trust and confidence with users and key stakeholders. Once the incident is categorized and the right response team is assigned, teams can send notifications to stakeholders on the status of the incident. The most essential elements to communicate during the entirety of the incident lifecycle are:

Timeframe: expected resolution time
Status: the outcomes of the resolution activities and next steps

Communicating timely resolution updates to end users through practical channels, such as email, collaboration tools, or social media, is crucial. Therefore, an organization should adopt an efficient strategy for prompt alerts for ticket updates, responses, and status changes. By automating notifications, the IT support team can ensure no gaps in communicating with stakeholders, which builds trust, understanding, and patience and aids overall coordination.

If there is a major incident, teams should have a defined role to manage communication between the team resolving the incident, and business stakeholders. Communication with external stakeholders such as regulators, media, and shareholders should be well-structured and governed by corporate communication guidelines. For example, CrowdStrike included an FAQ and statements from the CEO as part of their communication.

Hierarchical escalation — where upper levels of management need to be informed whenever certain resolution timelines are breached, or results are not meeting expectations — is a key component of the ITSM incident management incident notification stage. In a hierarchical escalation, senior leadership becomes crucial to decision-making, including notifying external stakeholders, unlocking resources, negotiating with vendor leadership, and triggering significant remedial actions such as failover to disaster recovery sites. In this case, an escalation matrix becomes a useful reference that spells out the timelines and associated management levels for escalation. For example, a 30-minute outage for a critical system might require the chief technology officer’s intervention, while the CEO may intervene at the 1-hour mark.

Incident investigation

Once the incident has been assigned, the relevant team collects data on the effects and source of the incident by examining the related symptoms affecting IT services and their associated components. They may talk to the original reporter of the incident or reach out to affected users to clarify what errors they encounter when interacting with the IT systems. They can also dig into monitoring tools alerts to pinpoint error codes or analyze system logs to determine dependencies and underlying causes. AI and ML tools come in handy at this stage as alerts and historical data can be automatically correlated through advanced analytics to determine the exact cause of the incident, predict its progression, and even trigger resolution actions. They can also liaise with vendors, integrators, and other parties to gain insights into what is responsible for causing the incident. These findings should then be documented in the incident record and used as a reference for anyone involved in the investigation.

The information gained from the investigation helps decide the best course of action to address the incident. Depending on the complexity of the issue, this may include strategies such as:

Deployment of predefined resolution procedures including automated solutions such as reboots, reconfiguration, or failover.
Functional escalation to specialist teams or third parties such as vendors for a more expert diagnosis.
Invoking IT disaster recovery plans in the cases of disasters.

Teams should document the selected strategy in the incident record and communicate with stakeholders. For complex incidents, a technique like swarming — where multiple people or teams work together on an issue until it becomes clear how to address it — may be useful. Incident investigation may also involve carrying out a series of safe-to-fail experiments that aim to improve the understanding of the nature of the incident.

Incident resolution

The outputs of the incident investigation phase determine the resolution actions, and the two phases can follow each other almost immediately. An incident resolution primarily involves managing the source events until their effects decline to an acceptable level.

The fastest way to do this is by deploying workarounds which ITIL 4 defines as a solution that reduces or eliminates the impact of an incident for which a full resolution is not yet available. This is very useful when dealing with a repeat incident whose cause is well understood, and the approach to contain it is readily available, especially where automation is involved. The organization may have documented incident models, including standard operating procedures for resolving known IT incident types. Knowledge articles such as self-help guides can enable users to address minor incidents affecting their devices or accounts.

But this may not work for a novel incident, and that’s where experimentation comes in. The swarming team may try different techniques to contain the incident’s effects. For example, teams may reroute traffic, reinstall software components, or switch off some functionality.

Because incidents need to be resolved as quickly as possible, the assigned teams should be careful not to be sucked into an in-depth investigation of the root cause. This should be left to the problem management process, which kicks in after the resolution and review phases. Workarounds promptly restore service to an acceptable quality and should be the primary approach to resolution. However, some complex cases will require addressing the root cause or failing over to disaster recovery sites as the only options for containing the incident.

In the case of information security incidents, containment of the incident source evidence through a secure chain-of-custody process should be carried out during this phase. The resolution actions and their outcomes should be documented concisely within the incident record and serve as a reference should the incident recur. In addition, the time taken to resolve the incident should also be captured as part of service-level management processes.

Incident review

Once an incident has been resolved, depending on the magnitude and level of complexity, a review will be carried out by key stakeholders to determine whether the approach taken was best, the next steps, and identify opportunities for improvement. The lessons learned from handling an incident are valuable sources of knowledge that can prevent future occurrences, and the details should be included in the ITSM knowledge base. Some incidents (e.g., critical incidents) will require an individual review upon resolution, while others may be reviewed in a consolidated manner during a scheduled forum. Per organizational policy, a major incident report may be needed to capture what went wrong, the handling process, and the next steps.

Some service delivery teams also conduct a stakeholder satisfaction survey to determine whether they were happy with the incident handling actions, especially as concerns communication.

The incident review phase can also trigger the problem management process where the incident’s root cause will be investigated, especially if it was not obvious during the investigation and resolution activities. In the case of CrowdStrike’s incident, the RCA was published over two weeks after the incident. Teams should also report key metrics related to the incident management process during the periodic reviews. Examples of incident metrics include:

Number of incidents
Detection success
Resolution time within SLA
Reassignment rate

Reviews should also result in improvement initiatives to update incident models, upskill staff, and invest in technology solutions to automate detection and resolution actions. This phase can also involve the review of service-level targets based on IT’s capability to detect and resolve incidents.

Last Thoughts

Incident management is an essential service management practice at the heart of customer experience. All technology systems are prone to disruption, and when they do happen, the best service providers are those who consistently deploy a structured incident management process, and invest in automated systems for detection and self-healing.

The two main factors that enable effective incident management are early detection and quick restoration, and linking their associated metrics to strategic objectives can validate that their approach delivers value to the organization and its stakeholders.

Do your ITSM practices support digital transformation, fuel business growth, and mitigate risks? And crucially, do the practices help achieve the goals at an acceptable total cost of ownership?

At SolarWinds, we have designed the ITSM Maturity framework as a free, consultative tool to help you figure out your current state, what you intend to achieve, and what tools you need to get there.

To assess your ITSM maturity, try our free interactive ITSM maturity model.

← Back to ITSM Best Practices Guide

Ready to create ITSMagic for your organization?

Learn More