AoAD2 Practice: Incident Analysis

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

Incident Analysis

Audience
Whole Team

We learn from failure.

Despite your best efforts, your software will sometimes fail to work as it should. Some failures will be minor, such as a typo on a web page. Others will be more significant, such as code that corrupts customer data, or an outage that prevents customer access.

Some failures are called bugs or defects; others are called incidents. The distinction isn’t particularly important. Either way, once the dust has settled and things are running smoothly again, you need to figure out what happened and how you can improve. This is incident analysis.

The details of how to respond during an incident are out of the scope of this book. For an excellent and practical guide to incident response, see Site Reliability Engineering: How Google Runs Production Systems [Beyer et al. 2016], particularly chapters 12-14.

The Nature of Failure

Failure is a consequence of your entire development system.

It’s tempting to think of failure as a simple sequence of cause and effect—A did this, which led to B, which led to C—but that’s not what really happens.1 In reality, failure is a consequence of the entire development system in which work is performed. (Your development system is every aspect of how you build software, from tools to organizational structure. It’s in contrast to your software system, which is the thing you’re building.) Each failure, no matter how minor, is a clue about the nature and weaknesses of that development system.

1My discussion of the nature of failure is based on [Woods et al. 2010] and [Dekker 2014].

Failure is the result of many interacting events. Small problems are constantly occurring, but the system has norms that keep them inside a safe boundary. A programmer makes an off-by-one error, but their pairing partner suggests a test to catch it. An on-site customer explains a story poorly, but notices the misunderstanding during customer review. A team member accidentally erases a file, but continuous integration rejects the commit.

When failure occurs, it’s not because of a single cause, but because multiple things go wrong at once. A programmer makes an off-by-one error, and their pairing partner was up late with a newborn and doesn’t notice, and the team is experimenting with less frequent pair swaps, and the canary server alerts were accidentally disabled. Failure happens, not because of problems, but because the development system—people, processes, and business environment—allows problems to combine.

Furthermore, systems exhibit a drift toward failure. Ironically, for teams with a track record of containing failures, the threat isn’t mistakes, but success. Over time, as no failures occur, the team’s norms change. For example, they might make pairing optional so people have more choice in their work styles. Their safe boundaries shrink. Eventually, the failure conditions—which existed all along!—combine in just the right way to exceed these smaller boundaries, and a failure occurs.

It’s hard to see the drift toward failure. Each change is small, and is an improvement in some other dimension, such as speed, cost, convenience, or customer satisfaction. To prevent drift, you have to stay vigilant. Past success doesn’t guarantee future success.

Small failures are a “dress rehearsal” for large failures.

You might expect large failures to be the result of large mistakes, but that isn’t how failure works. There’s no single cause, and no proportionality. Large failures are the result of the same systemic issues as small failures. That’s good news, because it means small failures are a “dress rehearsal” for large failures. You can learn just as much from them as you do from big ones.

Therefore, treat every failure as an opportunity to learn and improve. A typo is still a failure. A problem detected before release is still a failure. No matter how big or small, if your team thinks something is “done,” and it later needs correction, it’s worthy of analysis.

But it goes even deeper. Failures are a consequence of your development system, as I said, but so are successes. You can analyze them, too.

Conducting the Analysis

Ally
Retrospectives

Incident analysis is a type of retrospective. It’s a joint look back at your development system for the purpose of learning and improving. Textbook. As such, an effective analysis will involve the five stages of a retrospective: [Derby and Larsen 2006]

  1. Set the stage;

  2. Gather data;

  3. Generate insights;

  4. Decide what to do; and

  5. Close the retrospective.

Include your whole team in the analysis, along with anyone else involved in the incident response. Avoid including managers and other observers; you want participants to be able to speak up and admit mistakes openly, and that requires limiting attendance to just the people who need to be there. When there’s a lot of interest in the analysis, you can produce an incident report, as I’ll describe later.

The time needed for the analysis session depends on the number of events leading up to the incident. A complex outage could have dozens of events and take several hours. A simple defect, though, might only have a handful of events, and could take 30-60 minutes. You’ll get faster with experience.

In the beginning, and for sensitive incidents, a neutral facilitator should lead the session. The more sensitive the incident, the more experienced the facilitator needs to be.

This practice, as with all practices in this book, is focused on the team level—incidents which your team can analyze mainly on their own. You can also use it to conduct an analysis of your team’s part in a larger incident.

Set the stage
Ally
Safety

Because incident analysis involves a critical look at successes and failures, it’s vital for every participant to feel safe to contribute, including having frank discussions about the choices they made. For that reason, start the session by reminding everyone that the goal is to use the incident to better understand the way you create software—the development system of people, processes, expectations, environment, and tools. You’re not here to focus on the failure itself or to place blame, but instead to learn how to make your development system more resilient.

Ask everyone to confirm that they can abide by that goal and assume good faith on the part of everyone involved in the incident. Norm Kerth’s prime directive is a good choice:

Regardless of what we discover, we must understand and truly believe that everyone did the best job he or she could, given what was known at the time, his or her skills and abilities, the resources available, and the situation at hand. [Kerth 2001] (ch. 1)

In addition, consider establishing the Vegas Rule: What’s said in the analysis session, stays in the analysis session. Don’t record the session, and ask participants to agree not to repeat any personal details shared in the session.

If the session includes people outside the team, or if your team is new to working together, you might also want to establish working agreements for the session. (See “Create Working Agreements” on page XX.)

Gather data

Once the stage has been set, your first step is to understand what happened. You’ll do so by creating an annotated, visual timeline of events.

Stay focused on facts, not interpretations.

People will be tempted to interpret the data at this stage, but it’s important to keep everyone focused on “just the facts.” They’ll probably need multiple reminders as the stage progresses. With the benefit of hindsight, it’s easy to fall into the trap of critiquing people’s actions, but that won’t help. A successful analysis focuses on understanding what people actually did, and how your development system contributed to them doing those things, not what they could have done differently.

To create the timeline, start by creating a long horizontal space on your virtual whiteboard. If you’re conducting the session in person, use blue tape on a large wall. Divide the timeline into columns representing different periods in time. The columns don’t need to be uniform; weeks or months are often best for the earlier part of the timeline, while hours or days might be more appropriate for the moments leading up to the incident.

Have participants use simultaneous brainstorming to think of events relevant to the incident. (See “Work Simultaneously” on page XX.) Events are factual, non-judgemental statements about something that happened, such as “Deploy script stops all ServiceGamma instances,” “ServiceBeta returns 418 response code,” “ServiceAlpha doesn’t recognize 418 response code and crashes,” “On-call engineer is paged about system downtime,” and “On-call engineer manually restarts ServiceGamma instances.” (You can use people’s names, but only if they’re present and agree.) Be sure to capture events that went well, too, not just ones that went poorly.

Software logs, incident response records, and version control history are all likely to be helpful sources of inspiration. Write each event on a separate sticky note and add it to the board. Use the same color sticky for each event.

Afterwards, invite everyone to step back and look at the big picture. Which events are missing? Working simultaneously, look at each event and ask, “What came before this? What came after?” Add each additional event as another sticky note. You might find it helpful to show before/after relationships with arrows.

How was the automation used? Configured? Programmed?

Be sure to include events about people, not just software. People’s decisions are an enormous factor in your development system. Find each event which involves automation your team controls or uses, then add preceding events about how people contributed to that event. How was the automation used? Configured? Programmed? Be sure to keep these events neutral in tone and blame-free. Don’t second-guess what people should have done; only write what they actually did.

For example, the event “Deploy script stops all ServiceGamma instances” might be preceded by “Op misspells --target command-line parameter as --tagret” and “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found,” which in turn is preceded by “Team decides to clean up deploy script’s command-line processing.”

Events can have multiple predecessors feeding into the same event. Each predecessor can occur at different points in the timeline. For example, the event “ServiceAlpha doesn’t recognize 418 response code and crashes” could have three predecessors: “ServiceBeta returns 418 response code“ (immediately before); “Engineer inadvertently disables ServiceAlpha top-level exception handler” (several months earlier); and “Engineer programs ServiceAlpha to throw exception when unexpected response code received” (a year earlier).

As events are added, encourage participants to share recollections of their opinions and emotions at the time. Don’t ask people to excuse their actions; you’re not here to assign blame. Ask them to explain it was like to be there, in the moment, when the event occurred. This will help your team understand the social and organizational aspects of your development system—not just what choices were made, but why.

Ask participants to add additional stickies, in another color, for those thoughts. For example, if Jarrett says, “I had concerns about code quality, but I felt like I had to rush to meet our deadline,” he could write two sticky notes: “Jarrett has concerns about code quality” and “Jarrett feels he has to rush to meet deadline.” Don’t speculate about the thoughts of people who aren’t present, but you can record things they said at the time, such as “Layla says she has trouble remembering deploy script options.”

Keep these notes focused on what people felt and thought at the time. Your goal is to understand the system as it really was, not to second-guess people.

Finally, ask participants to highlight important events in the timeline—the ones which seem most relevant to the incident. Doublecheck whether people have captured all their recollections about those events.

Generate insights

Now it’s time to turn facts into insights. In this stage, you’ll mine your timeline for clues about your development system. Before you begin, give people some time to study the board. This can be a good point to call for a break.

The events aren’t the cause of failure; they’re a symptom of your system.

Begin by reminding attendees about the nature of failure. Problems are always occurring, but they don’t usually combine in a way that leads to failure. The events in your timeline aren’t the cause of the failure; they’re a symptom of how your development system functions. It’s that deeper system that you want to analyze.

Look at the events you identified as important during the “gather data” activity. Which of them involved people? To continue the example, you would choose the “Op misspells --target command-line parameter as --tagret” and “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found” events, but not “Deploy script stops all ServiceGamma instances,” because that event happened automatically.

Working simultaneously, assign one or more of the following categories2 to each people-involved event. Write each category on a third color of sticky note and add it to the timeline.

2The event categories were inspired by [Woods et al. 2010] and [Dekker 2014].

  • Knowledge and mental models: Involves information and decisions within the team involved in the event. For example, believing a service maintained by the team will never return a 418 response.

  • Communication and feedback: Involves information and decisions from outside the team involved in the event. For example, believing a third-party service will never return a 418 response.

  • Attention: Involves the ability to focus on relevant information. For example, ignoring an alert because several other alerts are happening at the same time, or misunderstanding the importance of an alert due to fatigue.

  • Fixation and plan continuation: Persisting with an assessment of the situation in the face of new information. For example, during an outage, continuing to troubleshoot a failing router after logs show that traffic successfully transitioned over to the backup router. Also involves continuing with an established plan; for example, releasing on the planned date despite beta testers saying the software isn’t ready.

  • Conflicting goals: Choosing between multiple goals, some of which may be unstated. For example, deciding to prioritize meeting a deadline over improving code quality.

  • Procedural adaptation: Involves situations in which established procedure doesn’t fit the situation. For example, abandoning a checklist after one of the steps reports an error. A special case is the responsibility-authority double bind, which requires people to make a choice between being punished for violating procedure or following a procedure that doesn’t fit the situation.

  • User experience: Involves interactions with computer interfaces. For example, providing the wrong command-line argument to a program.

  • Write-in: You can create your own category if the event doesn’t fit into the ones I’ve provided.

The categories apply to positive events, too. For example, “engineer programs back-end to provide safe default when ServiceOmega times out” is a “knowledge and mental models” event.

After you’ve categorized the events, take a moment to consider the whole picture again, then break into small groups to discuss each event. What does each one say about your development system? Focus on the system, not the people.

For example, the event, “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found,” sounds like it’s a mistake on the part of the engineer. But the timeline reveals that Jarrett, the engineer in question, felt he had to rush to meet a deadline, even though it reduced code quality. That means it was a “conflicting goals” event, and it’s really about how priorities are decided and communicated. As the team discusses the event, they realize they all feel pressure from sales and marketing to prioritize deadlines over code quality.

Incident analysis always looks at the system, not individuals.

On the other hand, let’s say the timeline analysis revealed Jarrett also misunderstood the behavior of the team’s command-line processing library. That would make it a “knowledge and mental models” event, too, but you still wouldn’t put the blame on Jarrett. Incident analysis always looks at the system, not individuals. Individuals are expected to make mistakes. In this case, a closer look at the event reveals that, although the team used test-driven development and pairing for production code, they didn’t apply that standard to their scripts. They didn’t have any way to prevent mistakes in their scripts, and it was just a matter of time before one slipped through.

After the breakout groups have had a chance to discuss the events—for speed, you might want to divide the events among the groups, rather than having each group discuss every event—come together to discuss what you’ve learned about the system. Write each conclusion on a fourth color of sticky note and put it on the timeline next to the corresponding event. Don’t make suggestions, yet; just focus on what you’ve learned. For example, “No systematic way to prevent programming mistakes in scripts,” “Engineers feel pressured to sacrifice code quality,” and “Deploy script requires long and error-prone command-line.”

Decide what to do

You’re ready to decide how to improve your development system. You’ll do so by brainstorming ideas, then choosing a few of your best options.

Start by reviewing the overall timeline again. How could you change your system to be more resilient? Consider all possibilities, without worrying about feasibility. Brainstorm simultaneously onto a table or a new area of your virtual whiteboard. You don’t need to match your ideas to specific events or questions. Some will address multiple things at once. Questions to consider include:3

3Thanks to Sarah Horan Van Treese for suggesting most of these questions.

  • How could we prevent this type of failure?

  • How could we detect this type of failure earlier?

  • How could we fail faster?

  • How could we reduce the impact?

  • How could we respond faster?

  • Where did our safety net fail us?

  • What related flaws should we investigate?

To continue the example, your team might brainstorm ideas such as, “stop committing to deadlines,” “update forecast weekly and remove stories that don’t fit deadline,” “apply production coding standards to scripts,” “perform review of existing scripts for additional coding errors,” “simplify deploy script’s command line,” and “perform user experience review of command-line options across all of the team’s scripts.” Some of these ideas are better than others, but at this stage, you’re generating ideas, not filtering them.

Once you have a set of options, group them into “control,” “influence,” and “soup” circles, depending on your team’s ability to make them happen, as described in “Circles and Soup” on page XX. Have a brief discussion about the options’ pros and cons. Then use dot voting, followed by a consent vote (see “Work Simultaneously” on page XX and “Seek Consent” on page XX), to decide which options your team will pursue. You can choose more than one.

As you think about what to choose, remember that you shouldn’t fix everything. Sometimes, introducing a change adds more risk or cost than the thing it solves. In addition, although every event is a clue about the behavior of your development system, not every event is bad. For example, one of the example events was, “Engineer programs ServiceAlpha to throw exception when unexpected response code received.” Even though that event directly led to the outage, it made diagnosing the failure faster and easier. Without it, something still would have gone wrong, and it would have taken longer to solve.

Close the retrospective

Incident analysis can be intense. Close the retrospective by giving people a chance to take a breath and gently shift back to their regular work. That breath can be metaphorical, or you can literally suggest that people stand up and take a deep breath.

Start by deciding what to keep. A screenshot or photo of the annotated timeline and other artifacts is likely to be useful for future reference. First, invite participants to review the timeline for anything they don’t want shared outside the session. Remove those stickies before taking the picture.

Next, decide who will follow through on your decisions, and how. If your team will be producing a report, decide who will participate in writing it.

Finally, wrap up by expressing appreciations to each other for your hard work.4 Explain the exercise and provide an example: “(Name), I appreciate you for (reason).” Sit down and wait. Others will speak up as well. There’s no requirement to speak, but leave plenty of time at the end—a minute or so of silence—because people can take a little while to speak up.

4The “appreciations” activity is based on [Derby and Larsen 2006] (pp. 119-120).

Some people find the “appreciations” activity uncomfortable. An alternative activity is for each participant to take turns saying a few words about how they feel now the analysis is over. It’s okay to pass.

Afterwards, thank everybody for their participation. Remind them of the Vegas rule (don’t share personal details without permission), and end.

Organizational Learning

Organizations will often require a report about the incident analysis’s conclusions. It’s usually called a post-mortem, although I prefer the more neutral incident report.

In theory, part of the purpose of the incident report is to allow other teams to use what you’ve learned to improve their own development systems. Unfortunately, people tend to dismiss lessons learned by other teams. This is called distancing through differencing. [Woods et al. 2010] (ch. 14) “Those ideas don’t apply to us, because we’re an internally-facing team, not externally-facing.” Or, “we have microservices, not a monolith.” Or, “we work remotely, not in-person.” It’s easy to latch onto superficial differences as a reason to avoid change.

Preventing this distancing is a matter of organizational culture, which puts it out of the scope of this book. Briefly, though, people have the most appetite for learning and change after a major failure. Other than that, I’ve had the most success from making the lessons personal. Show how the lessons affect things your audience cares about.

This is easier in conversation than with a written document. In practice, I suspect—but don’t know for sure!—that the most effective way to get people to read and apply the lessons from an incident report is to tell a compelling, but concise story. Make the stakes clear from the outset. Describe what happened and allow the mystery to unfold. Describe what you learned about your system and explain how it affects other teams, too. Describe the potential stakes for other teams and summarize what they can do to protect themselves.

Incident Accountability

Another reason organizations want incident reports is to “hold people accountable.” This tends to be misguided at best.

That’s not to say teams shouldn’t be accountable for their work. They should be! And by performing an incident analysis and working on improving their development system, including working with the broader organization to make changes, they are showing accountability.

Searching for someone to blame makes big incidents worse.

Searching for a “single, wringable neck,” in the misguided parlance of Scrum, just encourages deflection and finger-pointing. It may lower the number of reported incidents, but that’s just because people hide problems. The big ones get worse.

“As the incident rate decreases, the fatality rate increases,” reports The Field Guide to Understanding ‘Human Error’, speaking about construction and aviation. “[T]his supports the importance... of learning from near misses. Suppressing such learning opportunities, at whatever level, and by whatever means, is not just a bad idea. It is dangerous.” [Dekker 2014] (ch. 7)

If your organization understands this dynamic, and genuinely wants the team to show how it’s being accountable, you can share what the incident analysis revealed about your development system. (In other words, the final stickies from the “Generate Insights” activity.) You can also share what you decided to do to improve the resiliency of your development system.

Often, your organization will have an existing report template that you’ll have to conform to. Do your best to avoid presenting a simplistic cause-and-effect view of the situation, and be careful to show how the system, not individuals, allowed problems to turn into failures.

Questions

What if we don’t have time to do a full analysis of every bug and incident?

Incident analysis doesn’t have to be a formal retrospective. You can use the basic structure to explore possibilities informally, with just a few people, or even in the privacy of your own thoughts, in just a few minutes. The core point to remember is that events are symptoms of your underlying development system. They’re clues to teach you how your system works. Start with the facts, discuss how they change your understanding of your development system, and only then think of what to change.

Prerequisites

Ally
Safety

Successful incident analysis depends on psychological safety. Unless participants feel safe to share their perspective on what happened, warts and all, you’ll have trouble achieving a deep understanding of your development system.

The broader organization’s approach to incidents has a large impact on participants’ safety. Even companies that pay lip-service to “blameless postmortems” have trouble moving from a simplistic cause-effect view of the world to a systemic view. They tend to think of “blameless” as “not saying who’s to blame,” but to be truly blameless, they need to understand that no one is to blame. Failures and successes are a consequence of a complex system, not specific individuals’ actions.

You can conduct a successful incident analysis in organizations that don’t understand this, but you’ll need to be extra careful to establish ground rules about psychological safety, and ensure people who have a blame-oriented worldview don’t attend. You’ll also need to exercise care to make sure the incident report, if there is one, is written with a systemic view, not a cause-effect view.

Indicators

When you conduct incident analyses well:

  • Incidents are acknowledged and even incidents with no visible impact are analyzed.

  • Team members see the analysis as an opportunity to learn and improve, and even look forward to it.

  • Your system’s resiliency improves over time, resulting in fewer escaped defects and production outages.

  • No one is blamed, judged, or punished for the incident.

Alternatives and Experiments

Many organizations approach incident analysis through the lens of a standard report template. This tends to result in shallow “quick fixes” rather than a systemic view, because people focus on what they want to report rather than studying the whole incident. The format I’ve described will help people expand their perspective before coming to conclusions. Conducting it as a retrospective will also ensure everybody’s voices are heard, and the whole team buys in to the conclusions.

Many of the ideas in this practice are inspired by books from the field of Human Factors and Systems Safety. Those books are concerned with life-and-death decisions, often made under intense time pressure, in fields such as aviation. Software development has different constraints, and some of those transplanted ideas may not apply perfectly.

In particular, the event categories I’ve provided are likely to have room for improvement. I suspect there’s room to split the “knowledge” category into several categories. Don’t just add categories arbitrarily, though. Check out the further reading section and ground your ideas in the underlying theory, first.

The retrospective format I’ve provided has the most room for experimentation. It’s easy to fixate on solutions or simplistic cause-effect thinking during an incident analysis, and the format I’ve provided is designed to avoid this mistake. But it’s just a retrospective. It can be changed. After you’ve conducted several analyses using the format I’ve provided, see what you can improve by experimenting with new activities. For example, can you conduct parts of the “Gather Information” stage asynchronously? Are there better ways to analyze the timeline during the “Generate Insights” stage? Can you provide more structure to “Decide What to Do?”

Finally, incident analysis isn’t limited to analyzing incidents. You can also analyze successes. As long as you’re learning about your development system, you’ll achieve the same benefits. Try conducting an analysis of a time when the team succeeded under pressure. Find the events that could have led to failure, and the events that prevented failure from occurring. Discover what that teaches you about your system’s resiliency, and think about how you can amplify that sort of resiliency in the future.

Further Reading

The Field Guide to Understanding ‘Human Error’ [Dekker 2014] is a surprisingly easy read that does a great job of introducing the theory underlying much of this practice.

Behind Human Error [Woods et al. 2010] is a much denser read, but it covers more ground than The Field Guide. If you’re looking for more detail, this is your next step.

The previous two books are based on Human Factors and System Safety research. The website learningfromincidents.io is dedicated to bringing those ideas to software development. At the time of this writing, it’s fairly thin, but its heart is in the right place. I’m including it in the hopes that it will have more material by the time you read this.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

If you liked this entry, check out my best writing and presentations, and consider subscribing to updates by email or RSS.