No one knows this better than Moustafa Aboelnaga, software engineering manager at SolarWinds. Moustafa leads teams of engineers in detecting and remediating problems in the IT environment. I sat down with him for a masterclass in planning an effective incident response framework.
What Does A Poor Incident Response Framework Look Like?
When Moustafa joined the team, he noticed several processes hindering the team's ability to discover and resolve issues in the IT environment.
- Disconnected communication channels: "We were receiving incidents from multiple platforms, including the service desk, monitoring tools, and even individuals," says Moustafa. “But we didn't have a unified hub that gathered all of these incidents and tickets in one place, so it was hard to track the incidents and gain meaningful insights."
- Communication redundancy: With incidents reported through multiple channels, updates were spread across different platforms, too. This made it difficult to track remediation. "In certain cases, I wasn't sure if my manager was informed or if the whole team had received the information," Moustafa observed.
- Inadequate monitoring: These issues were compounded by a subpar monitoring process. "It was passive monitoring, where 90% of our incidents were reported by our stakeholders rather than us," Moustafa explains. “The overreliance on stakeholder reports made it hard to be proactive”.
Having identified areas for improvement, Moustafa and his team set about overhauling the system.
Incident Response Framework Planning Step by Step
STEP 1: ACCELERATING MEAN TIME TO DETECTION
Enhancing monitoring was an early step. The team focused on tracking critical user journeys (CUJs): a sequence of steps a user must complete to achieve a specific goal, such as opening a website and reading expected content or sharing information with a salesperson on certain pages. SolarWinds® Observability supported more detailed and proactive oversight of these processes. The team could define service level agreements (SLAs), break down the user journeys into individual steps, and then set service level objectives (SLOs) for each step. The tool can be configured to monitor and check the journey every 5 minutes, 10 minutes, 1 hour, 2 hours, or any other frequency deemed appropriate.
The ability of SolarWinds Observability to handle both availability and synthetic transaction checks was also key. "Availability" simply means checking if the site is operational. If the site is opening, it’s working," Moustafa says. "However, determining whether the content displayed to the user is the expected content cannot be done through availability alone. This is where transactions are very useful. You can outline all the steps that the user should expect to see and test them at intervals, say every 15 minutes or 30 minutes, to ensure that the user is seeing the expected content."
Beyond monitoring, deeper integration of key tools was also essential. "Any incident we receive from the SolarWinds Observability solution will be sent directly to SolarWinds Service Desk, and will even be assigned to a member of our team.” Moustafa has also improved how stakeholders report the issues they encounter. "We established a distributed list and connected this email to the service desk," helping ensure all incidents are gathered in one unified platform, making it easier to collaborate on remediation.
STEP 2: STREAMLINING INCIDENT REMEDIATION
Detection is one thing, but what about fixing the problem? It all starts with visibility.
- Custom dashboards: "We created a custom dashboard on the service desk based solely on our incidents," Moustafa says. “There are filters for our ongoing incidents, including categorization, who is assigned to what, who the active requesters are, the average time to resolution, and the average number of assignments; all these metrics are customized."
- Communication enhancements: Keeping people informed throughout the incident resolution process was critical to maintaining trust. The team used Atlassian StatusPage and, later, Squadcast™ StatusPage as their tools for announcing incidents and providing updates. This approach made them more proactive in our communication with stakeholders. “The requester can see all updates among the different stakeholders. If we want to keep it private or technical, we can specify comments meant only for our developers or the biology engineer handling these incidents."
- Deeper integration: "We connected numerous tools, from the SolarWinds Service Desk to SolarWinds Observability and others, to create a more interconnected ecosystem," Moustafa notes. For example, Service Desk is integrated with Jira to improve ticket management. The more systems that can communicate effectively with one another, the more seamless operations can become.
- Refined processes: Implementing protocols like a more effective escalation policy also made a difference. "If the incident occurs during regular working hours, it’s classified as a regular incident," Moustafa explains. "If it happens after working hours, it's treated as an on-call incident, and we have a rotation for someone designated to address it." If the issue isn't resolved within 10 to 15 minutes, it is escalated to the manager.
The results speak for themselves. Since making these changes, the mean time to detection (MTTD) has improved by approximately 90% and the mean time to resolution (MTTR) has been reduced by approximately 70%.
STEP 3: EVOLVING THE SYSTEM WITH SQUADCAST
Despite the transformation, Moustafa still sees room for improvement. “The main disadvantage of our current process is that we are using too many tools. So, gathering reports and insights isn’t easy. We have to pull information from here and there and combine it all." Moustafa regards Squadcast, the incident response tool recently acquired by SolarWinds, as the next step in evolving incident response. Artificial intelligence is one reason why: “It’s like having real-time machine learning. Because we may receive a lot of incidents and many other alerts, all stemming from the same reason, it takes us a lot of time to merge and analyze all of these. The intelligent alert grouping is an AI model that combines all of these elements into one incident, which means you don’t have to look over 15 separate incidents.”
The Highest Level of Service
Effective incident response is a nuanced process involving the interplay of human talent, strategic workflow structuring, and the right tools for the job. Moustafa and his team have made great strides by streamlining communication, enhancing monitoring, and integrating systems more thoroughly. But the journey doesn’t end there. With the integration of Squadcast, they are poised to take their incident response to the next level. As Moustafa puts it, "Incident response is a constantly evolving field. We’re committed to staying ahead of the curve to provide the best possible service to our stakeholders."
Looking for faster, smarter, and more dependable incident response? Learn how to ensure the reliability of your most critical systems and keep everything online, here.