Technical Delivery Expert Solves Systemic Incident Response Issue
Case Study: Delivery Leadership
A multi-billion dollar global online travel company had tremendous difficulty identifying and resolving incidents with its service applications. This resulted in a poor user experience, including interruption to business transactions, for the company’s merchant clients.
Oftentimes, engineering teams would only learn about problems in the software from the network operations center, customer service, and executive leadership after clients reported incidents themselves. The lack of best practices and guidelines on building intelligent logs and alerts led to challenges identifying and solving problems efficiently. The problem was systemic, not having been addressed over a period of many years.
The solution that was in place to monitor live site incidents was a collection of custom-built and off-the-shelf applications. The primary tool was an outdated, unsupported version of Splunk, an incident-reporting app. The organization struggled with the absence of quality logs and an overabundance of low value Splunk alerts that created noise and reduced the organization’s effectiveness in identifying problems. The organization had outdated and inaccurate procedures for responding to the alerts in use. In addition, no definition of severity levels for triggered alerts had been defined.
Because the issue was deeply technical and pervasive, affecting the organization at many levels, the company knew it needed a consultant with expertise in engineering as well as program management, change management, and team leadership to right the ship. The company found that expertise in AIM Consulting, whose Delivery Leadership practice includes specialization in deploying experienced Technical Delivery Experts to solve complex technical challenges.
A Technical Delivery Expert (TDE) combines technical skills with program/project/process management skills to dig into thorny, critical issues and lead teams to solve large and complex technical problems. AIM’s Technical Delivery Expert possessed the skills and experience to understand the tools involved in the organization’s software stack, application logging, and incident reporting and response to be able to provide executive-level strategy and direction to fix the situation.
Working across 10 teams, the AIM Technical Delivery Expert employed a continuous improvement model designed to add immediate value to new software projects while at the same time making improvements to logging and alerting for existing applications. The model was a comprehensive approach designed to eliminate process waste and fuel enthusiasm and responsibility across the organization. The Technical Delivery Expert evangelized goals for the engineering team related to two key ideals:
- Being the first to know about priority incidents 80% of the time —This required engineering to audit existing alerts and expected actions to remediate issues. This meant creating and executing upon an organization-wide roadmap to improve alert quality (including eliminating low value alerts), creating and training an alert monitoring team, categorizing the severity of alerts to match incident severity, and creating metrics and reporting around alert quality and actions to support overall goals.
- Quickly mitigating priority incidents within 60 minutes at least 80% of the time — This required improving the alert response and incident remediation process, NOC (network operations center) training on incident response for each application, enhancing alert response documentation & troubleshooting steps, establishing guidelines on how teams fix errors, and strengthening relationships with dependencies in the organization.
These efforts were combined with an upgrade to Splunk and a cleanup and migration of old alerts and dashboards to the new version.
The Technical Delivery Expert established a weekly review process to go over live-site incidents that occurred and the mitigation actions performed in response. Team members were required to provide metrics around every live site incident that included the following:
- Were we first to know, and if not, why?
- Were we able to remediate the issue in 60 minutes or less, and if, not why?
- What alert was triggered for the incident?
- Step-by-step summary of steps taken to remediate the incident.
- What was the root cause of the incident using 5-whys (a six sigma cause and effect tool).
- Was there a dependent team that was the cause of the problem?
The weekly meetings resulted in the creation of the alert process now in use throughout the entire company. With these metrics in place, workers walked into the meetings aware of incidents and prepared to discuss what happened and how to improve the incident response process.
Simultaneously, the Technical Delivery Expert created and enhanced an incident-monitoring team in China to address alerts during the “off hours” in the U.S. The China team’s work to define incident response actions was fed into the company’s NOC as a repository of information. This two-headed approach evolved the alert process significantly in a short period of time.
AIM’s Technical Delivery Expert was able to view and approach the struggling incident response process in the organization comprehensively by uniting 10 teams with common goals. Metrics around being the “first to know” about an incident improved from 25 to 86 percent, and the percentage of priority incidents addressed within 60 minutes improved from 19 to 56 percent.
With better processes for incident identification and response, the user experience with the website improved dramatically for the company’s merchant clients, becoming more stable and reliable with a higher perception of quality and trust.
Finally, the expertise brought by AIM’s Technical Delivery Expert established greater trust between software engineers, software engineering managers, NOC managers, and other stakeholders, which became the driving force behind a culture change in incident response, resulting in the implementation of new processes and goals across the entire organization. The best practices implemented are now considered normal operating procedure and a culture of continuous improvement, which previously was aspirational, is now a reality.