Is Your ITSM Problem Management Up to Par? 5 Best Practices to Implement

Recurring problems — like repeat login failures and slow connectivity — are operational challenges that plague IT. Both customers and management heads come down hard on IT for repeat outages and incidents because they imply a lack of care, competence, and diligence in IT systems management.

Repeated incidents often stem from IT solving the immediate problem, not the root cause. For example, following a system failure, IT may reboot a component or change a configuration without determining why the initial failure occurred. However, when service disruption recurs, customer dissatisfaction increases as they question service reliability. The negative sentiment can ultimately result in churn and damaged reputations.

The ITSM problem management practice addresses this issue and helps teams minimize service disruption by proactively identifying and analyzing incident causes, implementing preventative measures, reducing incident impact, and resolving problems.

The ITIL 4 practice guides define a problem as a cause or potential cause of one or more incidents. There are three main phases involved in tackling problems:

  1. Identify problems by investigating incidents and system data
  2. Control the problem with analysis and workarounds
  3. Control the error through permanent corrective action or containment

This article will explore five proven ITSM problem management best practices for effective management of IT services across on-premise, cloud, and vendor-managed deployment models.

Summary of ITSM Problem Management best practices

The table below summarizes the five proven ITSM problem management best practices:

Best Practices

Description

Apply both reactive and proactive problem-identification techniques

Comprehensively investigate problem sources from both proactive and reactive perspectives.

Record and track problems and associated elements in a centralized system

Deploy a centralized system to log, prioritize, and relate problem records with incidents, changes, and configuration items.

Train staff on investigative and causal analysis techniques

Build internal competence through thorough root cause analysis and innovative solution development.

Deploy and monitor workarounds and permanent solutions

Ensure short-and long-term actions to reduce problem impact are implemented and monitored.

Analyze and review permanent solutions to problems

Conduct a formal analysis and regular review of the effectiveness of problem solutions.

Apply both reactive and proactive problem-identification techniques

When executive leadership asks questions like “Why does our e-commerce website keep crashing?” their interest is not in a quick fix. They are concerned with repeat issues and want the root cause addressed. Worded differently, they want IT to become proactive by preventing the recurrence of the same issue.

This is where problem-identification techniques come in. Problem identification primarily uses one of two approaches:

  • Reactive problem identification: This is the default approach to identifying problems as it involves investigating the causes of incidents that have already happened. This starts by understanding the symptoms of the incidents and then drilling down to actual causes by reviewing incident records, user complaints, error logs, and other information sources. This approach aims to prevent recurrence and may contribute to resolving open incidents.
  • Proactive problem identification: This approach aims to identify problems before they can cause incidents. It involves the identification of related risks to IT systems availability and performance, intending to minimize the probability and/or impact should the risks materialize. Proactive investigations involve the review of associated data such as vendor vulnerability information, pre-deployment testing, and system log analysis.

As the adage goes, “prevention is better than cure.” Mature organizations promote proactive problem management above the reactive approach since investing in proactive problem identification and resolution yields greater value in service delivery. According to VeriSM, proactive actions require management support and understanding, especially as the results are not as obvious as fixing an IT service outage.

Record and track problems and associated elements in a centralized system

Problem identification leads to registering a problem record in readiness for analysis and prioritization as part of problem control. The information captured during the creation of a problem record should be well articulated, as it serves as the basis for investigation and decision-making. At a bare minimum, the information captured should include the date of registration, description of the problem event, and categorization based on source, impact, and other dimensions. Based on the recording system, the problem is assigned automatically or manually to an owner to coordinate the problem control actions. Ownership may be assigned to a dedicated problem manager role with good knowledge of IT systems configuration and solid analytical skills, or be transferred to a temporary cross-functional team guided by a lead to investigate and address the problem. Without ownership, there is a significant risk of the problem management efforts petering out with no conclusive actions.

Logging and prioritizing problem records in a centralized ITIL problem management system is a good practice. Problem management systems should include workflow management and collaboration capabilities to facilitate record management and integration with other ITSM modules and related information. This ensures that the problem management actions and timelines are tracked as they progress from identification to control and closure. Such a system enables effective problem management through the following capabilities:

  • Ability to add categorization, prioritization, assignment, and escalation attributes
  • Supports mapping of problem records with associated workarounds and known errors
  • Association of problem records with associated incident and change records and configuration items from a configuration management database (CMDB) to support identification and control activities
  • Integration with knowledge management systems to facilitate capacity building and collaboration across teams
  • Reporting and dashboard features to report problem management metrics and impacts

 

Problem-management-module.png

SolarWinds Service Desk Problem Management module

Modern solutions incorporate machine learning capabilities to suggest solutions to tickets, perform incident correlation, conduct sentiment analysis, and auto-categorize tickets. By shifting the burden off of humans, the analysis is conducted without bias or fatigue, with high precision and consistency, which is a significant advantage for problem control.

promo_section_ITSMProblemManagement.png

The Best Incident Is the One That Never Happened

Go beyond temporary fixes. Analyze incident trends, link related tickets, and conduct root cause analysis to help eliminate recurring issues at their source.

Train staff on investigative and causal analysis techniques

“Why” is at the heart of effective ITSM problem management. To answer “why?” effectively, organizations should equip the people involved in problem management with the technical and analytical skills to diagnose root causes and develop substantive remedies for repeat failures.

A framework such as Skills for the Information Age (SFIA) can be a helpful reference to determine the right competencies and skills for particular organizational levels. Like many evolving frameworks, SFIA undergoes periodic updates to its competencies and skill definitions.

The current version (9) defines the levels and capabilities in the table below.

Level

Capabilities

Level 2

- Assists with problem management tasks under routine supervision.

- Helps document problems and maintain relevant records.

- Assists in detecting, logging, classifying, and prioritizing problems in systems, processes, and services.

Level 3

- Investigates problems in systems, processes, and services.

- Contributes to the implementation of agreed remedies and preventative measures.

Level 4

- Initiates and monitors actions to investigate and resolve problems in systems, processes, and services.

- Determines problem fixes and remedies.

- Collaborates with others to implement agreed remedies and preventative measures.

- Supports analysis of patterns and trends to improve problem management processes.

Level 5

- Ensures appropriate action is taken to anticipate, investigate, and resolve problems in systems and services.

- Ensures problems are fully documented within the relevant reporting systems.

- Enables development of problem solutions. Coordinates the implementation of agreed remedies and preventative measures.

- Analyses patterns and trends and improves problem management processes.

There are many techniques involved in problem investigation, analysis, and resolution. Examples of popular and proven problem-management techniques include:

  • 5-Whys:  This approach is helpful as a way to get to the underlying root cause of a problem. 5-Whys describes the event and then asks ‘Why did this happen?’. The answer is followed by another round of ‘why did this happen?'’.  Usually, by the fifth iteration, an actual root cause is found.
  • Kepner and Tregoe: This approach consists of four main “processes” that structure thinking about problems. Kepner and Tregoe emphasizes a fact-based rational approach that reduces the risk of bad assumptions. The processes are Situation Appraisal, Problem Analysis, Decision Analysis, and Potential Problem Analysis.
  • Ishikawa Diagrams: Also known as fishbone diagrams or cause-and-effect diagrams, this approach is the output of brainstorming sessions that document causes and effects that can be useful for continuous improvement and root cause analysis. The trunk of the diagram visually represents the main goal, primary factors are represented as branches, and secondary factors are added as stems.
  • Pareto Charts: This approach, based on the 80/20 principle, involves the analysis and visualization of data to facilitate the assessment and prioritization of competing problems and focus efforts on the issues that matter most.
  • Fault Tree Analysis: Also known as event tree analysis, this technique is a systematic graphical method used to analyze the potential causes of system failures. Fault Tree Analysis is a top-down approach that starts with an undesired event and then breaks it down into its contributing factors through a visual representation that looks like branch forks on a tree.

The human side of ITSM problem management is also important. Organizations should emphasize and encourage collaboration and ethics because teamwork is essential to effective problem management. Additionally, cross-functional swarming techniques can enable more effective investigation of known errors and identification of possible solutions.

Deploy and monitor workarounds and permanent solutions

Once a problem is analyzed, it is assigned the “known error” status. Controlling known errors starts with identifying and documenting workarounds that reduce or eliminate the impact or likelihood of a problem for which a full resolution is not yet available. Examples of workarounds include rebooting, reconfiguration, and load optimization, among other actions.

A prime example was when Southwest Airlines experienced an operational meltdown at the end of 2023 when Winter Storm Elliott landed in the United States. The extreme weather conditions caused a domino effect of flight cancellations and passengers trying to rebook their flights, which overwhelmed the airline's computer systems and revealed weaknesses in their plans for operational continuity.

The airline came up with an action plan to address the root causes, which included accelerating the operational modernization plan of tools and technology, aligning various Network Planning and Network Operations Control Teams under one Senior Leader for better collaboration, and enhancing data on early-indicator dashboards, among other initiatives. The early-indicator dashboards eventually proved to be a workaround to reduce the probability of IT system outages by having teams take necessary action to forestall system overloads.

Documenting workarounds and sharing them with first-line support teams is crucial to speeding up incident resolution efforts. While teams can maintain formal documentation in a knowledge repository, communicating workarounds during an incident should use the most preferred collaboration channels, such as Slack or Teams.

IT should monitor and regularly review the effectiveness of workarounds in curtailing the impact of known errors. A workaround that is no longer relevant in addressing a problem’s impact should have its documentation archived, and support teams notified of the most current version. Linking workaround articles to incident tickets makes it easier for anyone assigned an issue to trace previous incidents and then reference the workaround knowledge articles to see the best approach to tackling future related incidents.

After identifying a permanent solution, teams should use the change management ( “enablement” in ITIL 4 lingo) practice to manage its deployment. Internal teams or vendors may perform the deployment. Linking the change request record to the problem closes the loop with the related incidents and provides a body of knowledge for continuous IT improvement.

Analyze and review permanent solutions to problems

Problem records are closed when the risk impact or probability has been contained to an acceptable level or the context in which the problem exists has been removed. Permanent solutions may take time and can come with high costs.

Southwest Airlines’ decision to fast-track its modernization plan, budgeted at more than $1.3 billion, would affect the outlay. Not all known errors are immediately addressed after they are identified. In some cases, avoiding implementation is reasonable. For example, the cost of implementing a proposed solution (e.g., upgrading a legacy system to a modern tech stack) may be higher than the projected value.

Tracking and reporting metrics on the effectiveness of problem management is a key governance activity. Reports on the effectiveness of problem resolution based on prevented recurrence or reduced impact are important pointers on whether the practice is generating value through improved service availability and performance. Additionally, teams should report the status of problem-resolution activities because delays or funding constraints can prolong the adverse effects on service delivery.

Organizations should share information about solutions with relevant teams to bolster future problem detection and control efforts. A well-maintained knowledge base that captures problem information, including links to associated IT assets, incidents, and changes, can serve as a solid reference for improving IT service performance and informing improvement initiatives such as future architecture changes or upgrades.

Tackling Problem Management with SolarWinds

IT teams often find themselves swamped with similar incidents, making it tough to see the bigger picture. SolarWinds Service Desk helps by grouping these incidents together and giving agents a clear overview of everything that's going on - when the problem started, who it affected first, any patterns in the devices involved, and so on.

Track-changes-configurations.png

Track changes and configurations that can cause problems with SolarWinds

SolarWinds Service Desk also has the ability to find the source of IT problems by cross-referencing changes and configurations of your IT environment. It can seamlessly integrate with your configuration management database (CMDB) and provide a complete view of hardware and software assets and how they've been modified.

More details on how SolarWinds can help with faster identification of problem sources, quicker resolution, and proactive prevention of future issues can be found here.

ITSM maturity and reliable service delivery require a deliberate integrated approach to incident response, and then, once services are restored, ensuring problems are investigated and addressed. SolarWinds ITSM Maturity framework is a free, consultative tool designed to help you figure out your current state, what you intend to achieve, and what tools you need to get there.

ITSM-Maturity-Model.png

The interactive SolarWinds ITSM Maturity Model

To analyze your team’s ITSM maturity, check out our free interactive ITSM maturity model.

Back to ITSM Best Practices Guide

Ready to create ITSMagic for your organization?