Agile Adoption on the “Agile in Action” Podcast

I appeared on Bill Raymond’s Agile in Action podcast this week. It’s a nice, wide-ranging discussion about Agile adoption, the Agile Fluency® Model, and the ways companies can invest in agility.

Listen here.

AoAD2 Chapter: Scaling Agility

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 16, 2021

Scaling Agility

In a perfect world, every Agile team would be perfectly isolated, completely owning their product or portfolio of products. Cross-team coordination is a common source of delays and errors. If every team were isolated, that wouldn’t be a problem.

It’s also not at all realistic. A typical Agile team has 4-10 people. That’s often not enough.

So, then, how do you scale? Although this book is focused on individual Agile teams, the question is important enough to deserve a chapter of its own.

Scaling Fluency

Far too often, organizations invest in scaling Agile without investing in teams’ fluency or organizational capability.

Far too often, organizations try to scale Agile without actually having the ability to be Agile in the first place. They invest a lot of time and money in the large-scale Agile flavor of the day, without investing in teams’ fluency or organizational capability. It never works.

In order to scale Agile, you’ll need to scale your organization’s ability to be Agile. This involves three parts: organizational capability, coaching capability, and team capability.

Organizational Capability

One of the biggest mistakes organizations make in trying to introduce Agile is to fail to make the investments described in chapter “Invest in Agility”. But even if your organization takes those investments seriously, there’s likely to be some hidden trouble spots.

Before you spend a lot of money on scaling Agile, work out the kinks in your organizational capability. If you’re working with an expert advisor, they’ll have specific suggestions. If you’re going it alone, start with a pilot team, or a small group of teams—no more than five.

As your pilot teams develop their fluency, they’ll identify organizational roadblocks and problems that prevent them from achieving fluency. Make note of those problems. They’re likely to recur. You don’t need to solve them organization-wide, but you do need to solve them for each team you’d like to be Agile.

Once you’ve established your organization’s ability to support fluent teams, then you can scale out further. Until then, tempting though it may be to push Agile aggressively, stick with your existing approach for everyone but your pilot teams.

Coaching Capability

You’ll need a coach or coaches who can help with the big picture of scaling Agile: cross-team coordination and organizational capability, product/portfolio management, and change management. Although you can use books and training to develop these coaches internally, it’s best to hire someone who’s done it before.

You’ll also need skilled team-level coaches, and this is likely to be the main limit on your ability to scale. Team-level coaches are the people who help each team become fluent, and they’re vital. Every team will need at least one, as I discuss in “Coaching Skills” on page XX.

You can either hire experienced team-level coaches or develop your own coaches in-house. If you’re taking a home-grown approach, each coach will need resources, such as this book, to help them learn.

You can scale out more quickly by encouraging experienced team-level coaches to move to another team when their current team approaches fluency. (The checklists at the beginnings of parts 2-4 will help you gauge fluency.) By that point, some team members are likely to be qualified to act as team-level coaches themselves, and can start developing their coaching skills on your more-experienced teams. Be sure this sort of lateral movement enhances, rather than harms, your coaches’ careers, or your supply of coaches will dry up.1

1Thanks to Andrew Stellman for pointing out the dangers of lateral movement on Twitter (https://twitter.com/AndrewStellman/status/1316114014322274304).

Coaching skills are different than development skills. Even your best team members could struggle with learning how to be a good coach. You might be able to scale out your team-level coaching capability faster by hiring people to coach the coaches.

Experienced team-level coaches may be able to work with two teams simultaneously, although it’s not always a good idea for teams pursuing Delivering fluency. Less-experienced coaches should be dedicated to a single team.

Team Capability

Your coaches will help your teams gain fluency. The more experienced your coaches, the faster this will go, but it will still take time. “Make Time for Learning” on page XX gives some ballpark figures.

You can brute-force the problem by hiring a big consultancy to staff your teams with experienced Agile developers at a ratio of 50% or more. With the right people and a high enough ratio, this can result in instant fluency, if you’ve already made the effort to establish organizational capability.

Be cautious of this approach. The strategy is sound, if costly, but execution tends to falter. The people who augment your teams play a huge role in the success of this approach, and there’s a very real risk of hiring the wrong firm. Everybody says their developers have Agile skills, but even for big-name firms, there’s a lot more bandwagon-riding than actual capability. With a few notable exceptions, when the added staff have any Agile skills at all, they’re usually limited to the Focusing zone.

The other risk of the staff augmentation approach is coaching skills. Even if the added staff have the skills needed to create instant fluency—which is far from certain—they aren’t likely to have the skills to coach, too. The Agile changes could fail to stick when the consultancy pulls out.

The staff augmentation approach can work if you hire the right firm. If you go that route, be sure to supplement it with a focus on growing your own coaches. Don’t expect your staff-augmentation firm to do it for you; it’s a very different skill-set. Look to smaller “boutique” and independent consultancies that specialize in Agile change and coaching-the-coaches. The people you hire matter more than the vendor, especially for these specialized skills, and small consultancies do a better job of recognizing this.

Scaling Products and Portfolios

Successfully scaling Agile is a matter of figuring out how to manage dependencies.

Fluency is the basis for successfully scaling Agile, but it isn’t enough on its own. Unless every team works completely independently, you also need a way of coordinating their work. This is harder than it sounds, because teams have dependencies on each other that tend to result in bottlenecks, delays, and communication errors. Successfully scaling Agile is a matter of figuring out how to manage those dependencies.

There are two basic strategies for scaling Agile: vertical scaling, which attempts to increase the number of teams who can work together without bottlenecks, and horizontal scaling,, which attempts to remove bottlenecks by isolating teams’ responsibilities. The two strategies can be used together.

Scaling Vertically

Vertical scaling is about increasing the number of teams who can share ownership of a product or portfolio. By “sharing ownership,” I meant that they don’t have a defined area to work on. Every team can work on every part of the product and can touch any code.

I’ll discuss two approaches to doing so: LeSS and FAST. For clarity, I’ll use the terminology in this book rather than the terms they use, but I’ll put their terms in parentheses.

LeSS

LeSS, which stands for “Large-Scale Scrum,” is one of the original Agile scaling approaches. It was created by Craig Larman and Bas Vodde in 2005.2

2Many thanks to Bas Vodde for providing feedback on my discussion of LeSS.

Basic LeSS is suitable for 2-8 teams of up to eight people each. All teams work from the same visual plan (which LeSS calls the “product backlog”) and they share ownership of all their code. There’s also LeSS Huge, which scales to even more teams. I’ll discuss it later.

A group of LeSS teams is guided by a product manager (LeSS calls them a “product owner”) who is responsible for deciding product direction. The teams work in fixed-length iterations which are typically two weeks long. At the beginning of every iteration, the teams come together to look at the visual plan and decide which customer-centric stories (“backlog items” or “features”) each team will work on. The teams only work on the highest-priority stories.

Every so often, the teams come together to play the planning game (“refine the backlog”). This typically happens in the middle of each iteration. Teams are welcome to add stories to the visual plan and suggest priorities to the product manager.

Each LeSS team is a feature team, which means they work on complete stories, from beginning to end, regardless of which code that involves. Once a team takes responsibility for a story, they own it. They’re expected to work with customers and other stakeholders to clarify details, and they’re expected to modify and improve whichever parts of the codebase necessary to finish each story. There’s no concept of team-level code ownership in LeSS.

Because multiple LeSS teams could end up touching the same code, they’re expected to coordinate with each other to prevent problems. The coordination is typically ad-hoc and peer-to-peer. Team members know when they need to coordinate because they worked together to choose stories, and part of the discussion involves considering how and when to coordinate.

Collective code ownership is made possible through the use of continuous integration, which involves every programmer merging their latest code to a shared branch at least every few hours. LeSS also includes a variety of other mechanisms for coordinating, mentoring, and learning.

Adopting LeSS

The material in this book is completely compatible with LeSS, except that most things related to team ownership are owned by the LeSS teams together, rather than a specific team. This is particularly true of product management and code ownership. Additionally, some of LeSS’s terms are different than this book’s, but they can found in the index.

Continuous integration is particularly important for LeSS, and the commit build needs to be fast. You might need to use multi-stage builds (see “Multistage Integration Builds” on page XX) more aggressively than this book recommends. Specifically, you may need to move some or all of your tests to the secondary build, despite the increased risk of breaking the build.

Allies
Collective Code Ownership
Test-Driven Development
Continuous Integration

If you’re looking for an established, well-tested approach to scaling Agile, start with LeSS. You’ll need to develop fluency in the Focusing and Delivering zones. The Focusing zone is fundamental and the Delivering zone is necessary for teams to share ownership of their code. At a minimum, you’ll need collective code ownership, test-driven development, and continuous integration.

For more about LeSS, see the LeSS website at less.works, or the LeSS book, Large-Scale Scrum: More with LeSS. [Larman and Vodde 2015]

FAST
FAST is one of the most promising approaches to scaling I’ve seen.

FAST stands for Fluid Scaling Technology. It’s the brainchild of Ron Quartel, and it’s one of the most promising approaches to scaling I’ve seen. Unfortunately, at the time of this writing, it’s also the least proven. I’m including it because I think it deserves your attention.3

3Many thanks to Ron Quartel for providing feedback on my discussion of FAST.

Ron Quartel created FAST at a health insurance provider in Washington. At its peak, he had 65 people operating as a single team. He started with Extreme Programming (XP) as the base, then layered on Open Space Technology, a technique for helping large groups self-organize around topics.

In comparison to LeSS, FAST is much more, well, fluid. LeSS is based on iterations and long-lived teams that own specific stories. FAST uses a continuous flow of work and forms new teams every few days. There’s no team-level ownership in FAST.

A FAST group is called a “tribe.” Each tribe consists of developers and one or more product managers (which FAST calls “product directors”) who are responsible for setting direction. The whole tribe can consist of up to 150 people, in theory, although that hadn’t been tested at the time of this writing.

Every two days—although this is flexible—the tribe gets together for a “FAST Meeting,” where they decide what to work on. It’s a short, quick meeting. The product managers explain their priorities, and then people volunteer to lead a team to work on something. These leaders are called “team stewards.” Anybody can volunteer to be a steward. It’s a temporary role that only lasts until the next FAST Meeting.

Product managers’ priorities are a guide, not a dictate. Team stewards can choose to work on whatever they like, although they’re expected to act in good faith. That sometimes involves doing something the product managers didn’t explicitly ask for, such as cleaning up crufty code or reducing development friction.

Once the stewards have volunteered, and explained what their team will work on, the rest of the tribe self-selects onto the teams, according to who they want to work with and what they want to work on.

Rather than creating detailed story breakdowns, FAST teams create a “discovery tree” for each valuable increment. (A valuable increment is something that can be released on its own—see “Valuable Increments” on page XX.) A discovery tree is hierarchical, just-in-time breakdown of the work required to release the increment. It’s represented with sticky notes on a wall, or virtual stickies on a virtual whiteboard.

Teams work for two days, or whatever cadence the tribe has chosen. They’re not expected to finish anything specific in that time. Instead, they just make as much progress as they can. The discovery trees are used to provide continuity and help people see progress. Someone may also volunteer to be a “feature steward” for a particular discovery tree, if needed for additional continuity. Other cross-team coordination happens on an ad-hoc, peer-to-peer basis, similar to LeSS.

After the two days are up, the tribe has another FAST meeting. The teams briefly recap their progress and the cycle repeats. It’s fast, fluid, and low ceremony.

Adopting FAST

FAST isn’t as compatible with this book as LeSS is. Many of the practices in the Focusing zone won’t apply perfectly.

Allies
Team Room
Alignment
Retrospectives
Visual Planning
The Planning Game
Task Planning
Capacity
Slack
Stand-Up Meetings
Forecasting
Team Dynamics

Specifically:

  • Everything that refers to “the team” in this book applies to the overall FAST tribe instead;

  • You will have additional team room needs, although the existing guidance remains relevant, especially for remote teams;

  • Alignment chartering and retrospectives have to be adjusted to work with a larger group of people, and they’re likely to need more experienced facilitation, especially for remote teams;

  • Visual planning applies as-is, but no longer includes anything smaller than a valuable increment;

  • The planning game, task planning, and capacity are no longer needed;

  • Slack needs to be introduced in another way;

  • Stand-up meetings are replaced by the FAST meeting;

  • Forecasting is entirely different (and much simpler, although its accuracy hasn’t been assessed); and

  • Team dynamics are complicated by the lack of stable teams.4

4XXX Add reference to Dynamic Reteaming?

On the other hand, the Delivering and Optimizing practices apply equally well. As with LeSS, you may need to be more aggressive about the speed of continuous integration.

Although FAST hasn’t been proven to the degree LeSS has, I think it’s very promising. If you have Agile experience and are comfortable trying it with a pilot team of 10-30 people, I recommend giving it a shot.

To try FAST, you’ll need experienced coaches. In theory, FAST only requires Focusing fluency, but Ron Quartel included experienced XP coaches in his FAST pilot, and I suspect their familiarity with Delivering as well as Focusing practices is part of what made FAST work. If you try it, I suggest you do the same.

You can find more about FAST at fastagile.io. Look for the “FAST Guide.” It’s a quick and easy read. I also have an interview with Ron Quartel about FAST at [Shore 2021].

Challenges and benefits of vertical scaling
Ally
Collective Code Ownership

The achilles heel of vertical scaling is also its strength: shared ownership. A vertically-scaled group of teams shares responsibility for the entire codebase. This requires people to have familiarity with a wide variety of code. In practice, at least for LeSS and FAST, people do tend to specialize, choosing to work on things that are familiar, but it’s still a lot to learn.

That’s not the biggest problem, though. The real problem is that it’s easy for collective code ownership to turn into no code ownership. You see, collective code ownership doesn’t just grant the ability to change code; it also grants a responsibility to make the code better when you see an opportunity. It’s easy for large groups to assume somebody else will do it. This can be a problem in small teams, too, but it’s magnified in large groups. They require extra coaching to help people follow through on their responsibility.

On the other hand, vertical scaling solves one of the major problems when scaling Agile: creating cross-functional teams. Agile teams need people with specialized skills, such UX design, operations, and security. If your teams only have six or seven people each, it’s hard to justify including people with those skills on every team. But then you run into an allocation problem. How do you make sure each team has everyone it needs at the time it needs them?

This isn’t a problem for vertically-scaled groups. If you have thirty people, and only enough work for two UX folks, no problem. You can include just two UX people. In FAST, they’ll allocate themselves to the teams that need their skills. In LeSS, they’ll join a specific team or two, and those teams will volunteer for UX-related work.

Scaling Horizontally

Although vertical scaling is my preferred approach to large-scale Agile, many organizations turn to horizontal scaling instead. In horizontal scaling, the focus is on allowing teams to work in isolation. Rather than sharing ownership of a product or portfolio, as vertical scaling does, horizontal scaling slices up the product or portfolio into individual responsibilities which are owned by specific teams.

The challenge in horizontal scaling is to define team responsibilities in a way that keeps teams isolated as possible. It’s very difficult to get right, and it has trouble adjusting to changes in product priorities.

In theory, each team should own a customer-centric end-to-end slice of the product. In practice, horizontally-scaled teams are so small, they have trouble owning a whole slice. You end up with two teams needing access to the same code. But in the horizontally-scaled model, teams aren’t supposed to share code with other teams.

As a result, although the ideal is for every team to own a slice of the product, you almost always have to introduce other, less ideal types of teams as well. The book Team Topologies [Skelton and Pais 2019] divides them into four categories:

  • Stream-aligned teams. The ideal. Focused on a particular product, customer-facing slice of a product, or customer group.

  • Complicated-subsystem teams. Focused on building a part of the system that requires particularly specialized knowledge, such as a machine-learning component in a larger cloud offering. These types of teams should be created carefully, and only when the knowledge needed is truly specialized.

  • Enabling teams. Focused on providing specialized expertise to other teams, such as UX, operations, or security. Rather than doing work on behalf of other teams, which would cause them to become a bottleneck, they focus on helping teams learn how to do the work themselves. Sometimes this involves providing resources for simplifying complex problems, such as security checklists or UX design guidelines.

  • Platform teams. Similar to enabling teams, except they provide tooling rather than direct help. Like enabling teams, they don’t solve problems for other teams; instead, their tools allow teams to solve their own problems. For example, a platform team may provide tools for deploying software.

The secret to successful horizontal scaling is how you allocate responsibilities to teams. The fewer cross-team dependencies, the better. It’s fundamentally a question of architecture, because the responsibilities of your teams need to mimic your desired system architecture. (This is also called the Inverse Conway Maneuver.)

Horizontal scaling works best when you only have a handful of teams. When the number of teams are small, it’s easy to understand how everyone fits together and to coordinate their work. If there’s a problem, representatives from each team can get together and work things out.

The ad-hoc coordination approach breaks down somewhere between 5-10 teams. Bottlenecks start to form, with some teams stalled and others having too much work. You have to pay particular attention to your team design to keep teams as independent as possible and to minimize cross-team dependencies. Every team, especially non-stream-aligned teams, have to make their dependents’ autonomy their top priority, and product managers have to coordinate carefully to make sure everyone’s work aligns.

When you get up to 30-100 teams, even that approach starts to break down. Changes are more frequent and team responsibilities have to be adjusted to keep up with changes in business priorities. You need multiple layers of coordination and management. It becomes impossible for people to understand the whole system.

In practice, although horizontal scaling can continue indefinitely, it becomes more and more difficult to manage as the number of teams grows. Vertical scaling is more flexible, but it can’t scale as far. Fortunately, you can combine the two approaches to get the best of both worlds.

Scaling Vertically and Horizontally

I worked with a start-up that had reached 300 team members and stalled out. (The overall organization had over 1,000 people, but about 300 were on product development teams.) Their teams were all working on different aspects of the same product and their cross-team dependencies were killing them.

I approached it from a horizontal scaling perspective. I helped them restructure their team responsibilities to minimize dependencies and maximize isolation. They ended up with about 40 teams—about the same as before—but they were much more independent. That unblocked their development efforts, and they resumed growing. They got up to 80 teams before they hit new roadblocks.

Everybody was very happy with the results. If I could do it over again, though, I would have introduced vertical scaling too. Instead of 40 teams, they could have formed six 50-person groups. Coordinating six vertically-scaled groups is dramatically easier than 40 small teams, and they wouldn’t have had any problem scaling further. Even once they started running into coordination challenges, the horizontal scaling techniques would allow them to grow by an order of magnitude.

Better yet, because vertically-scaled groups are so large, they all could have been stream-aligned. The design we created had a bunch of enabling teams and platform teams, some of whom struggled to understand their role. Stream-aligned teams are much more straightforward. With vertically-scaled groups, that’s all they would have needed, except for their operations platform.

Part of the reason things broke down when they reached 80 teams is that they hadn’t kept their team responsibilities up to date. We had designed in a mechanism for reviewing and updating team responsibilities—it was the job of the architecture team—but, as so often happens, it got forgotten in the rush of meeting other responsibilities. Vertically-scaled groups don’t need the same amount of maintenance. They have the ability to adapt to changing business conditions much more easily.

In other words, you can combine horizontal scaling with vertical scaling by thinking of your vertically-scaled groups as a single “team” from a horizontal scaling perspective. If you do, almost every one can be stream-aligned, with the possible exception of a group for your operations platform.

My Recommendation

Bottom line: How should you scale your Agile organization?

Begin by emphasizing team fluency.

Begin by emphasizing team fluency. The most common mistake organizations make is to spread Agile widely without building their fundamental capability. In most cases, to scale well, you’ll need your teams to develop both Focusing and Delivering fluency.

Scale vertically before you scale horizontally. In most cases, LeSS is your best choice. If you’re experienced and willing to experiment, try FAST.

If you reach the limits of vertical scaling—probably somewhere around 60-70 people, although FAST may be able to scale further—split into multiple vertically-scaled groups. Each one should be stream-aligned. You shouldn’t need a complicated-subsystem group or enabling group, because your groups will be large enough to include all the expertise you need. In some cases, you might want to extract out a platform group to take care of common infrastructure—typically, an operations and deployment platform.

If you’re using LeSS, LeSS Huge describes this sort of horizontal scaling split, albeit with a slightly different flavor. It retains LeSS’s emphasis on collective code ownership, even across the two groups (which LeSS calls “areas”). However, in practice, the groups tend to specialize.

But remember: successful scaling depends on fluent teams. That’s what the rest of this book is about. We’ll start with Focusing fluency.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Chapter: Into the Future

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 19, 2021

Into the Future

Agile teams never stop learning, experimenting, and improving. The practices in this book are only the starting point. Once you understand a practice, make it yours! Experiment with alternatives and seek out new ideas. As you become more fluent, deliberately break the rules and see what happens. You’ll learn why the rules exist... and what their limits are.

What comes after that? That’s for you to decide. Agile is always customized to the needs of the team.

In the Agile Fluency Model, Diana Larsen and I identified a possible fourth zone: Strengthening. If you look carefully, each zone represents a different expansion of the team’s circle of control: Focusing gives the team ownership of their tasks; Delivering gives them ownership of their releases; Optimizing gives them ownership of their product.

Strengthening continues this trend by expanding teams’ ownership over organizational strategy. People don’t just make decisions focused on their teams; they come together to make decisions affecting many teams. One example that’s starting to enter the mainstream is team self-selection. In team self-selection, team members decide for themselves which team they’ll be part of, rather than being assigned by management.

Sounds crazy? It’s not. It’s carefully structured, not a free-for-all. (See [Mamoli and Mole 2015] for details.) I’ve used team self-selection myself and it’s surprisingly effective. The results are better than I’ve seen from traditional manager-driven selection. It leads to teams that are highly productive out of the gate.

The Strengthening zone is about this sort of bottom-up decision-making. Governance approaches such as Sociocracy (sociocracy.info) and Holacracy (holacracy.org) are experimenting in this space, as are companies such as Valve Software, Semco, and W. L. Gore & Associates. Jutta Eckstein and John Buck’s book, Company-wide Agility with Beyond Budgeting, Open Space & Sociocracy [Eckstein and Buck 2020]1 goes into more detail. For a lighter-weight introduction to the philosophy, see Ricardo Semler’s Maverick. [Semler 1995] It’s a fascinating account of the author’s revitalization of his company’s management approach.

1XXX Review new edition

That said, the Agile Fluency Model has never been a maturity model. You’re not required to pass through the zones in order, or achieve fluency in every zone. Although individual practices, such as team self-selection, have their place, I suspect full Strengthening fluency is inappropriate for most companies. But if you want to live on the cutting edge and join the ranks of the innovators who made Agile what it is today, the Strengthening zone is one place to start. Beyond that... who knows? There are additional zones waiting to be discovered.

Ultimately, though, Agile doesn’t matter. Really! What matters is success, for your team members, organization, and stakeholders, in whatever way they define it. Agile practices, principles, and ideas are merely guides along the way. Start by following the practices rigorously. Learn how to apply the principles and key ideas. Break the rules, experiment, see what works, and learn some more. Share your insights and passion, and learn even more.

Over time, with discipline and experience, the practices and principles will become less important. When doing the right thing is a matter of instinct and intuition, finely honed by experience, it’s time to leave rules and principles behind. It won’t matter what you call it. When your intuition leads to great software that serves a valuable purpose, and your wisdom inspires the next generation of teams, you will have mastered the art of Agile development.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Chapter: Discovery

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 19, 2021

Discovery

Optimizing teams make their own product decisions. How do they know what to build?

Ally
Whole Team

Partly, they know what to build because they include people with product expertise. Those team members have the background and training to decide what to do.

But the fact is, at least at the beginning of a new product, nobody is 100% sure what to do. Some people pretend to know, but Optimizing teams don’t. Their ideas are, at best, very good guesses about what will lead to success.

The job of the Optimizing team isn’t to know what to build, but to discover what to build.

So the job of the Optimizing team isn’t to know what to build, but to discover what to build. Steve Blank, whose work was the basis for the Lean Startup movement, put it this way:

[T]he task is unambiguous—learn and discover what problems customers have, and whether your product concept solves that problem; understand who will buy it; and use that knowledge to build a sales roadmap so a sales team can sell it to them. And [you] must have the agility to move with sudden and rapid shifts based on what customers have to say and the clout to reconfigure [your team] when customer feedback requires it. [Blank 2020] (app. A)

Steve Blank, The Four Steps to the Epiphany

Steve Blank was talking about startups, but this quote applies equally well to Optimizing teams. Even if you aren’t selling your software! No matter who your customers and users are—even if they’re Keven and Kyla, who sit in the next cubicle over—your job is to figure out how to bring them value. And, just as importantly, how to do so in a way they will actually buy or use.

Validated Learning

I can’t count the number of times I’ve had a good idea, put it in front of real customers or users, and found out that it didn’t work out. Sure, they would tell me they loved the idea when I told them about it. Sometimes, even after they tried a prototype! It was only when I asked people to make a real expenditure—of time, money, or political capital—that I learned my “good idea” wasn’t good enough.

Product ideas are like a perpetual motion machine: if you believe hard enough, and have enough inertia, they look like they’ll last forever. Put a real load on them, though, and they grind to a halt.

Allies
Blind Spot Discovery
Real Customer Involvement

Validated learning is one of your best tools for testing ideas. I discussed it in “Validated Learning” on page XX, but to recap, validated learning involves making a hypothesis about your market, building something you can put in front of them, and measuring what happens. Use what you’ve learned to adjust your plans, then repeat. This is often referred to as the Build-Measure-Learn loop.

To truly validate your learning, you need real customers (or users) and real costs. If you show what you’ve built to people who aren’t part of your target market, you’ll get feedback, but it might not be relevant to your actual situation. And if you don’t ask them to commit something in exchange, you’ll learn more about people’s desire to avoid hurting your feelings than about the actual value of your idea. Everybody will praise your idea for a luxury vacation... until you ask them for their down payment.1

1Then it’s all, “Oh, I don’t have time,” “I couldn’t leave my chihuahua Fluffles all alone,” and “I hate tropical sand. It’s rough and irritating, and it gets everywhere.”

Adaptability

Ally
Adaptive Planning

Every time you go through the Build-Measure-Learn loop, you’ll learn something new. To take advantage of what you learned, you’ll have to change your plans. As a result, Optimizing teams tend to keep their planning horizons short and their plans adaptable. They keep their valuable increments small so they can change direction without waste.

Valuable increments (see “Valuable Increments” on page XX) aren’t just about features and capabilities. Remember, there are three common categories of value:

  • Direct value. You’ve built something that provides one of the types of value described in “What Do Organizations Value” on page XX.

  • Learning value. You’ve built something that helps you understand your market and future prospects better.

  • Option value. You’ve built something that allows you to change direction for less cost.

For Optimizing teams, learning and options are just as important as direct value. In the beginning, they can even be more important than direct value, because they allow the team to avoid wasting time building the wrong things. Every Build-Measure-Learn loop is an example of a “learning value” increment.

Options thinking is also common in Optimizing teams. The future is uncertain, and no plans are set in stone, so Optimizing teams ensure they have the ability to adapt. They do so by thinking about future possibilities and building “option value” increments. A prospective analysis, described in “Prospective Analysis” on page XX, is one way to start identifing those options.

Options are also an important technique for managing risk. If your prospective analysis shows a substantial risk—for example, a competitor providing a less lucrative, but more attractive pricing model—you could build an option that allowed you to change your pricing model with the flip of a switch.

Another sort of option involves deadlines. Although Optimizing teams avoid arbitrary deadlines, sometimes value depends on releasing before a certain date. For example, video games need to be delivered in time for the holiday season, tax software needs to be updated yearly, and new regulations can have strict deadlines with harsh compliance penalties.

To meet these deadlines, Optimizing teams will often build a “safety” increment before embarking on a more ambitious idea. The “safety” increment fulfills the demands of the deadline, in a minimal way, leaving the team free to work on their more ambitious ideas without worry. If those ideas doesn’t pan out, or can’t be completed in time, the team releases the “safety” increment instead.

For example, reviewer Bill Wake shared the (possibly apocryphal) story of a printer company which needed to deliver a red eye removal feature for their new photo printer. The hardware had a strict release date, so the software team started with a primitive red eye algorithm, then worked on a more sophisticated approach.

Experiments and Further Reading

There’s much, much more to deciding product direction than I can cover in this book. Opportunities for further reading abound; look in the product management category. One place to start is Marty Cagan’s Inspired.2

2XXX review, add bibliography reference

The point to remember is that, in addition to normal product management, Optimizing teams engage with their customers to understand their market and validate their ideas. They exist to learn as much as they to do to build, and the flexibility of their plans reflects that focus. The Lean Startup movement calls this customer discovery and customer validation.

For much more detail about these ideas, see The Startup Owner’s Manual [Blank and Dorf 2020]. It’s an updated version of Steve Blank’s book, The Four Steps to the Epiphany. [Blank 2020] Blank’s ideas, combined with Extreme Programming, formed the basis of Eric Ries’ Lean Startup movement. [Ries 2011]

As you can imagine, The Startup Owner’s Manual is focused on startups, so its advice will need customization to your situation, but Optimizing teams have a lot of similarities to startups. A successful Optimizing team isn’t just carrying on with the status quo. If it were, Focusing and Delivering fluency would be sufficient. Instead, it’s seeking ways to lead its market and develop new markets. Lean Startup ideas, including the foundational ideas of customer discovery and customer validation, are a key part of how you can do so.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Chapter: Autonomy

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 19, 2021

Autonomy

Optimizing requires a level of autonomy most organizations aren’t ready to support.

Optimizing fluency is fairly rare, but it’s not because the Optimizing zone represents a big change in Agile practices. On the contrary: Optimizing is mostly an application of the practices found throughout the rest of this book. Optimizing fluency isn’t rare because it’s hard; it’s rare because it requires a level of team autonomy most organizations aren’t ready to support.

Everybody knows Agile teams are supposed to be autonomous, but organizations with Optimizing teams really mean it. For them, autonomy is more than just enabling teams to work independently. They give their teams full responsibility for their finances and product plans, too.

Business Expertise

Allies
Whole Team
Team Dynamics
Stakeholder Trust

Of course, in order for your team to own its financial and product decisions, the team needs to have the ability to make good decisions. A whole team consisting of both business and development expertise has always been the goal, but many organizations short-change the business side of their teams. They assign a product manager who can only participate a few hours a week, or assign product “owners” who have no real decision-making authority. Some teams get the worst of both worlds: product owners who are spread too thin and have no decision-making authority.

Optimizing teams have real business authority and expertise. It’s not siloed behind a single person, either. Everybody on the team takes an interest in producing value. Some more than others, of course, but there’s no jealous hoarding of responsibility. You’ll get the best results when your entire team sees their job as learning how to better serve customers, users, and stakeholders.

Business Decisions

One of the most striking thing about Optimizing teams is their lack of emphasis on user stories. They have stories, of course, as a planning mechanism, but they’re not the topic of their conversations with stakeholders. Instead, they’re all about business results and value. They’re not trying to deliver a set of stories; that’s a detail. They’re trying to make a meaningful difference to their organization.

This is particularly true of their relationship with management. Optimizing teams have the trust of their organization. Executives and managers know they can give the team funding and a mission, then stand back. The team will work out how to achieve their mission on their own. They’ll let their executives know how the funding is being spent, what results they’re achieving, and what support they need to be more successful.

Ally
Adaptive Planning

One of the consequences of this approach is that Optimizing teams rarely follow a predetermined plan. In general, their valuable increments are small, their plans highly adaptive, and their planning horizons are short. Rather than working a big, static plan, they’re constantly testing ideas and making incremental progress. (At least from the perspective of internal stakeholders. They can still choose to save up work for a big splashy release.)

Ally
Forecasting

As a result, Optimizing teams tend not to have traditional deadlines or roadmaps. When they do set a deadline, it’s a choice they make for themselves. They do so because there’s a compelling business reason, such as coordinating with a marketing effort, not because it satisfies a bureaucratic requirement. If they realize they won’t be able to achieve a deadline, they decide for themselves how and when to change their plans.

Accountability and Oversight

Optimizing teams aren’t without oversight. They may have control over their budget and plans, but that doesn’t mean they get to do whatever they want. They still have to show their work and justify their big-picture decisions. They just don’t have to get advance approval for their decisions, so long as they relate to the team’s purpose and don’t require additional resources from the organization.

Ally
Purpose

The organization uses the team’s purpose to put guide rails around the team’s work. The team’s purpose sets out the big-picture direction for the team (the vision), their current near-term goal (the mission), and the signposts that lead to success (the indicators). Management provides the general direction, and the team collaborates with them and other stakeholders to work out the details. When the team sees an opportunity to change their purpose to be more valuable, they talk it over with management.

The team demonstrates its accountability by focusing on business results.

The team demonstrates its accountability, not by showing the stories it’s delivered, but by focusing on business results: both what they’ve achieved so far, and what they hope to achieve in the future. These results may be straightforward, such as revenue numbers, or more subtle, such as employee satisfaction scores. Either way, the emphasis is on outcomes, not deliverables and dates.

Allies
Stakeholder Demos
Roadmaps

Optimizing teams aren’t just trying to achieve short-term outcomes, though. They’re also constantly learning how to better serve their users and their market. So they also talk about what they’ve learned, what they want to learn next, and how they plan to do so. All this information is shared through the team’s internal demos, their internal roadmaps, and private conversations with management.

Funding

The team’s funding is another of the organization’s oversight mechanisms. Optimizing teams are typically funded on an ongoing “business as usual” basis (see “Agile Governance” on page XX). The organization allocates those funds based on the outcomes they expect from the team. The team can also procure one-off funds and resources by going to management with their justification.

Ally
Context

If the team doesn’t think they can achieve their purpose with the funds and other resources they have, they can ask their sponsor for more. If the sponsor doesn’t agree, the team and their sponsor collaborate to find a balance that can be achieved, or the team pivots to a new, more valuable purpose. This discussion typically happens during context chartering.

As the team’s work progresses, the organization’s predictions about value will come true... or not. This is an opportunity to adjust the team’s purpose. If the team is producing more value than expected, the funding can be increased, and the team can double down on their successes. If it’s producing less, the funding can be decreased, or the team can pivot to a more valuable purpose.

Experiments and Further Reading

I’ve only touched on the possibilities for autonomy in Optimizing teams. The Agile community is full of interesting ideas and experiments. Many of these experiments push into the Strengthening zone of fluency, which I touch upon in chapter “Into the Future”.

In the realm of Optimizing teams, though, one of the most interesting ideas is “Beyond Budgeting.” It has an emphasis on disseminating decision-making to customer-focused teams, similar to what I’ve described here, but it goes into much more depth on the management side of things. To learn more, see Jeremy Hope and Robin Fraser’s book, Beyond Budgeting. [Hope and Fraser 2003]

XXX Pat Reed? Johanna Rothman?

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Part IV: Optimizing Outcomes (Introduction)

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 19, 2021

Optimizing Outcomes

October has rolled around again. Last year, your team achieved Delivering fluency (see Part III). At the time, some team members wanted to push for Optimizing fluency, too, but management was skeptical. You couldn’t get the support you needed.

Since you’ve achieved Delivering fluency, though, your team has been firing on all cylinders. Productivity went way up; defects, way down. Hanna, your product manager, was having trouble keeping up. She delegated more and more responsibilities to the team, which rose to the challenge.

It got noticed. Hanna was singing your praises to the marketing director, and your boss was talking you up to the engineering director. The time was right to push for Optimizing fluency again. This time, it worked. Hanna was assigned to join your team full time. Not only that, she got permission to try “the Agile experiment.”

“The Agile experiment” is what they’re calling the way Hanna works with your team. Instead of having to go through a yearly planning exercise like the rest of Marketing, she got permission to own your team’s financials. She meets with her boss regularly to share statistics such as revenue and customer retention, and she’s constantly trying out new ideas and experiments. (Her colleagues are jealous. They still have to go through six weeks of budget and target-setting hell every year.)

It’s not just Hanna. The whole team is getting in on the action. Although Hanna is first among equals when it comes to product marketing expertise, other members of the team have developed their own areas of expertise. Shayna, in particular, loves visiting customer sites to see how people work.

Shayna’s just asked for the team’s attention. “I just finished a remote session with Magda,” she says. “You all remember Magda, right?” Nods all around. Magda is a developer who works for one of your new customers. Her company’s bigger than your normal customers, so they’ve been pretty demanding.

“Magda’s company has been dealing with an increasingly complex tax situation,” Shayna continues. “They have remote employees in more and more countries all over the world, and dealing with the various taxes and employment law is overwhelming. Magda’s heading up a team to automate some of that work, and she wanted to know how to integrate with our API.”

“But it got me thinking,” Shayna’s voice raises in excitement. “That isn’t too far off from what we do already. What if we sold an add-on module for international employment? It’s a lot of work, but we could start one country at a time. And Bo, you have some experience in this area, right?” Bo nods thoughtfully.

Hanna purses her lips. “It’s a big bet,” she says. “But it could have a huge pay-off. This could crack open the market for more companies like Magda’s. It would definitely widen our moat. None of our direct competitors have anything like that, and the big players charge two arms, a leg, and half your torso in professional services fees. Plus, we’re a lot more user-friendly.” She grins. It has a lot of teeth. “We’d only need to charge an arm and a leg. What do the rest of you think?”

Your team engages in a rapid-fire discussion of the idea. As you come to the consensus that it’s worth pursuing, Hanna nods sharply. “I love it. We’ll need to validate the market and figure out how to break it down into smaller bets. I’ll put a story on next week’s plan to come up with Build-Measure-Learn experiments. We can start on them after we release our current increment. In the meantime, I’ll do some research and run it by the boss. If the experiments work out, we’ll need her to approve more funding and a change to our mission.”

“Thanks, Shayna,” she finishes. “This is why I love being part of this team.”

Welcome to the Optimizing Zone

The Optimizing zone is for teams who want to create more value.

The Optimizing zone is for teams who want to create more value. They take ownership of their product plans and budget so they can experiment, iterate, and learn. This allows them to produce software that leads their market. Specifically, teams who are fluent at Optimizing:1

1These lists are derived from [Shore and Larsen 2018].

  • Deliver products that meet business objectives and market needs. (Teams fluent in the other zones deliver what they’re asked to deliver, which isn’t necessarily the same.)

  • Include broad-based expertise that promotes optimal cost/value decisions.

  • Understand where their products stand in the market and how they’ll improve their position.

  • Coordinate with leadership to cancel or pivot low-value products early.

  • Learn from market feedback to anticipate customer needs and create new business opportunities.

  • Make business decisions quickly and effectively.

To achieve these benefits, teams need to develop the following skills. Doing so requires the investments described in chapter “Choose Your Agility”.

The team responds to business needs:

  • The team describes its plans and progress in terms of business metric outcomes jointly identified with management.

  • The team collaborates with internal and external stakeholders to determine when and how roadmaps will provide the best return on investment.

The team works as a trusted, autonomous team:

  • The team coordinates with management to understand and refine their role in achieving the organization’s overall business strategy.

  • The team jointly takes responsibility, and accepts accountability, for achieving the business outcomes they identify.

  • Management gives the team the resources and authority they need to autonomously achieve their business outcomes.

  • Management ensures the team includes dedicated team members who have all the day-to-day skills the team needs to understand the market and achieve their business outcomes.

The team pursues product greatness:

  • The team engages with their customers and market to understand product needs and opportunities.

  • The team creates hypotheses about business opportunities and conducts experiments to test them.

  • The team plans and develops their work in a way that allows them to completely change plans, without waste, given less than a month’s notice.

Achieving Optimizing Fluency

The investments needed for Optimizing fluency challenge the preconceptions and established order of most companies. It requires giving up a lot of control and putting a lot of trust in the team. There’s oversight, but it can still be scary.

As a result, you’ll usually need to demonstrate success with Focusing and Delivering fluency for a few years before your company will give you the authority and autonomy needed for Optimizing fluency. Early-stage startups tend to be an exception, but everyone else will have some trust-building to do.

By the time you’re ready for Optimizing, your team is likely to have mastered the rest of the practices in this book. You won’t need a how-to guide any more. You’ll have mastered the art.

So the chapters in this part are short and sweet. They’ll help you get started, and provide clues about what to try next. It’s up to you to take what you’ve learned about Agile development, combine them with these ideas, and create something great of your own. These chapters will help you get started:

  • Chapter “Autonomy” discusses the nature of autonomous teams.

  • Chapter “Discovery” discusses ways your team can learn.

  • Chapter “Into the Future” wraps up with a look at what comes next.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: Team Dynamics

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 16, 2021

Team Dynamics

Audience
Whole Team

XXX by Diana Larsen

We steadily improve our ability to work together.

Your team’s ability to work together forms the bedrock of their ability to develop and deliver software. You need collaboration skills, the ability to share leadership roles, and an understanding of how teams evolve over time. Together, these skills determine your team dynamics.

Team dynamics are the invisible undercurrents that determine your team’s culture. They’re the way people interact and cooperate. Healthy team dynamics lead to a culture of achievement and well-being. Unhealthy team dynamics lead to a culture of disappointment and dysfunction.

Anyone on the team can have a role in influencing these dynamics. Use the ideas in this practice to suggest ways to improve team members' capability to work together.

What Makes a Team?

A team isn’t just a group of people. In their classic book, The Wisdom of Teams, Jon Katzenbach and Douglas Smith describe six characteristics that differentiate teams from other groups:

[A real team] is a small number of people with complementary skills who are committed to a common purpose, performance goals, and approach for which they hold themselves mutually accountable. [Katzenbach and Smith 2015] (ch. 5, emphasis mine)

The Wisdom of Teams

Arlo Belshee suggests another characteristic: a shared history. A group of people gain a sense of themselves as a team by spending time working together.

If you’ve followed the practices in this book, you have all the preconditions necessary to create a great team. Now you need to develop your ability to work together.

Team Development

In 1965, Bruce W. Tuckman created a well-known model of group development. [Tuckman 1965] In it, he described four—later, five—stages of group development: forming, storming, norming, performing, and adjourning. His model outlines shifts in familiarity and interactions over time.

Don’t interpret the Tuckman model as an inevitable, purely linear progression.

No model is perfect. Don’t interpret the Tuckman model as an inevitable, purely linear progression. Teams can exhibit behaviors from any of the first four stages. Changes in membership, such as gaining members or losing valued teammates may cause a team to slip into an earlier stage. When experiencing changes in environment, such as a move from colocated to remote work, or vice versa, a team may regress from later stages to earlier ones. Nevertheless, Tuckman’s model offers useful clues. You can use it to perceive patterns of behavior among your teammates and as a basis for discussions about how to best support each other.

Forming: The new kid in class

The team forms and begins working together. Individual team members recognize a sensation not unlike being the new kid in class: They’re not committed to working with others, but they want to feel included—or rather, not excluded—by the rest of the group. Team members are busy gaining the information they need to feel oriented and safe in their new territory.

You’re likely to see responses such as:

  • Excitement, anticipation, and optimism.

  • Pride in individual skills.

  • Concern about imposter syndrome (fear of being exposed as unqualified).

  • An initial, tentative attachment to the team.

  • Suspicion and anxiety about the expected team effort.

While forming, the team may produce little, if anything, that concerns its task goals. This is normal. The good news is, with support, most teams can move through this phase relatively quickly. Teams in the Forming stage may benefit from the wisdom gained from a senior team member's prior team experiences, from a team member who gravitates toward group cohesion activities, or from coaching in team collaboration.

Allies
Purpose
Context
Alignment

Support your teammates with leadership and clear direction. (More on team leadership roles later.) Start out by looking for ways team members to become acquainted with the work and each other. Establish a shared sense of the team’s combined strengths and personalities. Purpose, context, and alignment chartering are excellent ways to do so. You may benefit from other exercises to get to know each other, such as “A Connection-Building Exercise” on page XX.

Along with chartering, take time to discuss and develop your team's plan. Focus on the “do-able;” getting things done will build a sense of early success. (“Your First Week” on page XX describes how to get started.) Find and communicate resources available to the team, such as information, training, and support.

Acknowledge feeling of newness, ambivalence, confusion, or annoyance. They are natural at this stage. Although the chartering sessions should have helped make team responsibilities clear, clarify any remaining questions about work expectations, boundaries of authority and responsibility, and working agreements. Make sure people know how their team fits with other teams working on the same product. For in-person teams, explain what nearby teams are working on, even if it isn’t related to the team’s work.

During the Forming stage, team members need the following skills:

  • Peer-to-peer communication and feedback

  • Group problem-solving

  • Interpersonal conflict management

Ensure the team has coaching, mentoring, or training in these skills as needed.

Storming: Group adolescence

The team begins its shift from a collection of individuals to a team. Though they aren’t yet fully effective, they have the beginnings of mutual understanding.

During the Storming stage, the team deals with disagreeable issues. It's a time of turbulence, colaboratively choosing direction and making decisions together. That's why Tuckman et al. called it “Storming.” Team members have achieved a degree of comfort—enough to begin challenging each others’ ideas. They understand each other well enough to know where areas of disagreement surface, and they willingly air differences of opinion. This dynamic can lead to creative tension, or destructive conflicts, depending on how it’s handled.

Expect the following behaviors:

  • Reluctance to get on with tasks, or many differing opinions about how to do so.

  • Wariness about continuous improvement approaches.

  • Sharp fluctuations in attitude about the team and its chances of success.

  • Frustration with lack of progress or other team members.

  • Arguments between team members, even when they agree on the underlying issue.

  • Questioning the wisdom of the people who selected the team structure.

  • Suspicion about the motives of the people who appointed other members to the team. (These suspicions may be specific or generalized, and are often based more on past experience than the current situation.)

Support your Storming team by keeping an eye out for disruptive actions, such as defensiveness, competition between team members, factions or choosing sides, and jealousy. Expect increased tension and stress.

Ally
Safety

As you see these behaviors, be ready to intervene by describing the patterns you see. For example, “I notice that there’s been a lot of conflict around design approaches, and people are starting to form sides. Is there a way to bring it back to a more collegial discussion?” Maintain transparency, candor, and feedback, and surface typical conflict issues. Openly discuss the role of conflict and pressure in creative problem solving, including the connection between psychological safety and healthy conflict. Celebrate small team achievements.

When you notice an accumulation of storming behaviors on the team, typically a few weeks after the team first forms, pull the team together for a discussion of trust:

  1. Think back on all your experiences as part of any kind of team. When did you have the most trust in your teammates? Tell us a short story about that time. What conditions allowed trust to build?

  2. Reflect on the times and situations in your life when you have been trustworthy. What do you notice about yourself that you value? How have you built trust with others?

  3. In your opinion, what is the core factor that creates and sustains trust in organizations? What is the core factor that creates, nurtures, and sustains trust among team members?

  4. What three wishes would you make to heighten trust and healthy communication in this team?

This is a difficult stage, but it will help team members gain wisdom and lay the groundwork for the next stage. Watch for a sense of growing group cohesion. As cohesion grows, ensure that each member continues to express their diverse opinions, rather than shutting it down in favor of false harmony. (See “Dont Shy Away From Conflict” on page XX.)

Norming: We’re #1

Team members have bonded together as a cohesive group. They’ve found a comfortable working cadence and enjoy their collaboration. They identify as part of the team. In fact, they may identify so closely, and enjoy working together so much, symbols of belonging appear in the workspace. You might notice matching or very similar t-shirts, coffee cups with the team name, or coordinated laptop stickers. Remote teams might have “wear a hat” or “Hawaiian shirt” days.

Norming teams have created agreement on structure and working relationships. Informal, implicit behavior norms that supplement the team’s working agreements develop through their collaboration. People outside the team may notice and comment on the team’s “teamliness.” Some may envy it—particularly if team members begin to flaunt their successes or declare their team “the best.”

Their pride is warranted. Teams in the Norming stage make significant, regular progress toward their goals. Team members face risks together and work well together. You’ll see the following behaviors:

  • A new ability to express criticism constructively.

  • Acceptance and appreciation of differences among team members.

  • Relief that this just might all work out well.

  • More friendliness.

  • More sharing of personal stories and confidences.

  • Open discussions of team dynamics.

  • Desire to review and update working agreements and boundary issues with other teams.

How do you encourage your Norming team? Look outside your team boundaries and broaden team members' focus. Facilitate contact with customers and suppliers. (Field trips!) If the team's work relates to the work of other teams, ask to train in cross-team groups.

Build your team's cohesiveness and open your horizons, as well. Look for opportunities for team members to share experiences, such as volunteering together or presenting to other parts of the organization. Make sure these opportunities are suitable for all team members, so your good intentions don’t create in- and out-groups.

The skills needed by Norming teams include:

  • Feedback and listening.

  • Group decision-making processes.

  • Understanding the organizational perspective on their work.

Books such as What Did You Say? The Art of Giving and Receiving Feedback [Seashore et al. 2013] and Facilitators’ Guide to Participatory Decision-Making [Kaner 1998] will help the team learn the first two skills, and including the whole team in discussions with organizational leaders will help with the third.

Watch out for attempts to preserve harmony by avoiding conflicts.

Watch out for attempts to preserve harmony by avoiding conflicts. In their reluctance to return to Storming, team members may display groupthink: a form of false harmony where team members avoid disagreeing with each other, even when it’s justified. Groupthink: Psychological Studies of Policy Decisions and Fiascoes [Janis 1982] is a classic book that explores this phenomenon.

Ally
Safety

Discuss team decision-making approaches when you see the symptoms of groupthink. One sign is team members holding back on critical remarks to keep the peace, especially if they bring up their critiques later, after it’s too late to change course. Ask for critiques, and make sure team members feel safe to disagree.

One way to avoid groupthink is start discussions by defining the desired outcome. Work toward an outcome rather than away from a problem. Experiment with the following ground rules for team decisions:

  • Agree that each team member will act as a critical evaluator.

  • Promote open inquiry rather than stating positions.

  • Adopt a decision process that includes identifying at least three viable options before a choice is made.

  • Appoint a “contrarian” to search for counter-examples.

  • Split the team into small groups for independent discussion.

  • Schedule a “second chance” meeting to review the decision.

Performing: Team synergy

The team’s focus has shifted to getting the job done. Performance and productivity are the order of the day. The team connects with their part in the mission of the larger organization. They follow familiar, established procedures for making decisions, solving problems, and maintaining a collaborative work climate. Now the team is getting a lot of work done.

“Performing” teams transcend expectations. They exhibit greater autonomy, reach higher achievements, and have developed the ability to make rapid, high-quality decisions. Team members achieve more together than anyone would have expected from the sum of their individual effort. Team members continue to show loyalty and commitment to each other, while expressing less emotion about interactions and tasks than in earlier stages.

You’ll see these behaviors:

  • Significant insights into personal and team processes.

  • Little need for facilitative coaching. Such coaches will spend more time on liaising and mediating with the broader organization than on internal team needs.

  • Collaboration that’s understanding of team members’ strengths and limits.

  • Remarks such as “I look forward to working with this team,” “I can’t wait to come to work,” “This is my best job ever,” and “How can we reach even greater success?”

  • Confidence in each other, and trust that each team member will do their part toward accomplishing team goals.

  • Preventing, or working through, problems and destructive conflicts.

Individuals who have worked on Performing teams always remember their experience. They have stories about feeling closely attached to their teammates. If the team spends much time in Performing, team members may be very emotional about potential team termination or rearrangement.

Although Performing teams are at the pinnacle of team development, they still need to learn to work well with people outside the team. They’re not immune to reverting to earlier stages, either. Changes in team membership can disrupt their equilibrium, as can significant organizational change and disruptions to their established work habits. And there are always opportunities for further improvement. Keep learning, growing, and improving.

Adjourning: Separating and moving on

The team inevitably separates. They achieve their final purpose, or team members decide it’s time to move on.

Effective, highly productive teams acknowledge this stage. They recognize the benefit of farewell “ceremonies” that celebrate the team’s time together and help team members move on to their next challenge.

Communication, Collaboration, and Interaction

Team members’ communication, interaction, and collaboration creates group cohesion. These exchanges influence the team’s ability to work effectively—or not.

Consider my Team Communication Model, shown in figure “Larsen’s Team Communication Model”, which shows how effective team communication requires developing an interconnected, interdependent series of communication skills. It starts with developing just enough trust to get started. Each new skill pulls the team upward, while strengthening the supporting skills below.

A cutaway diagram of a hill with a tree on it. The hill shows four layers of strata. The bottom-most layer is labelled “Trust.” Above that is “commitment,” then “conflict,” then “creativity.” Above the hill, birds fly in the sky near the tree. This layer is labelled “high performance.” The diagram is marked “copyright 1993-2014 Diana Larsen, FutureWorks Consulting LLC. Permission to use with copyright information attached.”

Figure 1. Larsen’s Team Communication Model

Start with a strong base of trust
Ally
Alignment
Safety
Purpose

As you form your team, concentrate on helping team members find trust in one another. It doesn’t need to be a deep trust; just enough to agree to work together and commit to the work. Alignment chartering and an emphasis on psychological safety both help.

Support your growing trust with three-fold commitment

From a foundation of trust, your team will begin exploring the three-fold nature of team commitment:

  • Commitment to the team’s purpose;

  • Commitment to each other’s well-being; and

  • Commitment to the well-being of the team as a whole.

Chartering purpose and alignment will help build commitment. As commitment solidifies, trust will continue to grow. People’s sense of psychological safety will grow along with it.

Once commitment and trust start improving psychological safety, it’s a good time to examine the power dynamics of the team. No matter how egalitarian your team may be, power dynamics always exist. They’re part of being human. Left unaddressed or hidden, power dynamics turn destructive. It’s best to keep them out in the open, so the team can attempt to level the field.

Power dynamics come from individual perceptions of each other’s influence, ability to make things happen, and preferential treatment. Bring them into the open by holding a discussion of the power dynamics that exist in the team, and how they affect collaboration. Discuss how the team’s collective and diverse powers can be used to help the whole team.

Right-size conflicts with feedback

The more team members recognize each other’s commitment, the more their approach to conflict adapts. Rather than “you against me,” they start approaching conflicts as “us against the problem.” Focus on developing team members’ ability to give and receive feedback, as described in “Learn How to Give and Receive Feedback” on page XX. Approach feedback with the following goals:

  • The feedback we give and get is constructive and helpful.

  • Our feedback is caring and respectful.

  • Feedback is an integral part of our work.

  • No one is suprised by feedback; we wait for explicit agreement before giving feedback.

  • We offer feedback to encourage behavior as well as to discourage or change behavior.

Peer-to-peer feedback helps to deal with interpersonal conflicts while they’re small. Unaddressed, molehill resentments have the potential to grow into mountains of mistrust. The skills team members develop for feedback within the team will help them in larger conflicts with forces outside the team.

Spark creativity and innovation

What is team innovation, but the clash of ideas that sparks new potential? Retaining healthy working relationships while the sparks fly is a team skill. It rises from the ability to engage and redirect conflicts toward desired outcomes. It stimulates greater innovation and creativity. Team problem solving capability soars.

Allies
Retrospectives

Develop team creativity by offering learning challenges and playful approaches. Build it in to the team’s routine. Use slack to explore new technologies, as described in “Dedicate Time to Exploration and Experimentation” on page XX. Use retrospectives to experiment with new ideas. Make space for whimsy and inventive irrelevance. (Teach each other to juggle!)

Sustain high performance

When collaboration and communication skills join with task-focused skills, high performance becomes routine. The challenge lies in sustaining high performance. Avoid complacency. As a team, continue to refine your skills in building trust, committing to the work and each other, providing feedback, and sparking creativity. Look for opportunities to build resilience and further improve.

Shared Leadership

Mary Parker Follett, a management expert also known as “the mother of modern management,” was a pioneer in the fields of organizational theory and behavior. In discussing the role of leadership, she wrote:

It seems to me that whereas power usually means power-over, the power of some person or group over some other person or group, it is possible to develop the conception of power-with, a jointly developed power, a co-active, not a coercive power... Leader and followers are both following the invisible leader—the common purpose. [Graham 1995] (pp. 103, 172)

Mary Parker Follett

Ally
Whole Team

Effective Agile teams develop “power with” among all team members. They share leadership. (See “Key Idea: Self-Organizing Teams” on page XX.) By doing so, they make the most of their collaboration and the skills of the whole team.

Mary P. Follett described “the law of the situation,” in which she argued for following the lead of the person with the most knowledge of the situation at hand. This is exactly how Agile teams are meant to function. It means every team member has the potential to step into a leadership role. Everyone leads, at times, and follows a peer leader at others.

Team members can play a variety of leadership roles, as summarized in table “Leadership Roles”.1 People can play multiple roles, including switching at will, and multiple people can fill the same role. The important thing is coverage. Teams need all these kinds of leadership from their team members.

1With the exception of “Diplomats,” these roles were developed by Diana Larsen and Esther Derby, based on [Benne and Sheats 1948]. XXX Double-check rendering of table with final stylesheet.

Table 1. Leadership Roles

Task-OrientedCollaboration-Oriented
DirectionPioneer, InstructorDiplomat, Influencer, Follower
GuidanceCommentator, CoordinatorPromoter, Peacemaker
EvaluationCritic, Gatekeeper, ContrarianReviewer, Monitor
  • Pioneers (task-oriented direction) ask questions and seek data. They scout what’s coming next, looking for new approaches and bringing fresh ideas to the team.

  • Instructors (task-oriented direction) answer questions, supply data, and coach others in task-related skills. They connect the team to relevant sources of information.

  • Diplomats (collaboration-oriented direction) connect the team with people and groups outside the team, act as a liaison, and represent the team in outside meetings.

  • Influencers (collaboration-oriented direction) encourage the team in chartering, initiating working agreements, and other activities that build awareness of team culture.

  • Followers (collaboration-oriented direction) provide support and encouragement. They step back, allowing others to take the lead in their areas of strength, or where they’re developing strength. They conform to team working agreements.

  • Commentators (task-oriented guidance) explain and analyze data. They put information into context.

  • Coordinators (task-oriented guidance) pull threads of work together in a way that make sense. They link and integrate data and align team activities onto their tasks.

  • Promoters (collaboration-oriented guidance) focus on equitable team member participation. They ensure every team member has the chance to participate and help. They encourage quieter team members to contribute their perspectives on issues that affect the team.

  • Peacemakers (collaboration-oriented guidance) work for common ground. They seek harmony, consensus, and compromise when needed. They may mediate disputes that team members have difficulty solving on their own.

  • Critics (task-oriented evaluation) evaluate and analyze relevant data, looking for risks and weaknesses in the team’s approach.

  • Gatekeepers (task-oriented evaluation) encourage work discipline and maintain working agreements, as well as managing team boundaries to keep interference at bay.

  • Contrarians (task-oriented evaluation) protect the team from groupthink by deliberately seeking alternative views and opposing habitual thinking. They also vet the team’s decisions and against the team’s values and principles.

  • Reviewers (collaboration-oriented evaluation) ensure the team is meeting acceptance criteria and responding to customer needs.

  • Monitors (collaboration-oriented evaluation) attend to how the whole team is working together. (Are they working well, or not?) They protect the team’s psychological safety and foster healthy working relationships among team members.

“Follower” is a particularly powerful role for people who are expected to lead.

Although it may seem strange to include “follower” as a leadership role, actively following other people’s lead helps the team learn to share leadership responsibilities. It’s a particularly powerful role for people who are expected to lead, such as senior team members.

Teams that share leadership across these roles can be called leaderful teams. To develop a leaderful team, discuss these leadership roles together. A good time to do so is when you notice uneven team participation or over-reliance on a single person to make decisions. Share the list of roles and ask the following questions:

  • How many of the leadership roles does each team member naturally enact?

  • Is anyone overloaded with leadership roles? Or filling a role they don’t want?

  • Which of these roles need multiple people to fill? (For example, the Contrarian role is best rotated among several team members.)

  • Which of these roles are missing on our team? What's the impact of missing someone to fill these roles?

  • How might we fill the missing roles? Who wants practice in this aspect of leadership?

  • What else do we notice about these roles?

Focus the team on choosing how they will ensure their effective collaboration by covering the leadership roles. Be open to creating new working agreements in response to this conversation.

Some team members may be natural Contrarians, but if they always play that role, the rest of the team may fall into the trap of discounting their comments. “Oh, never mind. Li always sees the bleakest, most pessimistic side of things!” For the Contrarian role in particular, ensure that it’s shared among various team members, so it remains effective.

Toxic Behavior

Toxic behavior is any behavior that produces an unsafe environment, degrades team dynamics, or damages the team’s ability to achieve their purpose.

If a team member is exhibiting toxic behaviors, start by remembering the Retrospective Prime Directive: “Regardless of what we discover, we must understand and truly believe that everyone did the best job he or she could, given what was known at the time, his or her skills and abilities, the resources available, and the situation at hand.” [Kerth 2001] (ch. 1) Assume the person is doing the best job they can.

Look for environmental pressures first. For example, a team member may have a new baby and not be getting enough sleep. Or a new team member may be solely responsible for a vital subsystem they don't yet know well. Together, the team can make adjustments that help people improve their behavior. For example, agreeing to move the morning stand-up so the new parent can come in later, or sharing responsibility for the vital subsystem.

The next step is giving feedback to the person in question. Use the process described in “Learn How to Give and Receive Feedback” on page XX to describe the impact of their behavior and request a change. Very often, that’s enough. They didn’t realize how their behavior affected the team and they do better.

Be careful not to misidentify Contrarians as toxic.

Sometimes, teams can label colleagues as toxic when they aren’t actually doing anything wrong. This can easily happen to people who regularly take the Contrarian leadership role. They don’t go along with the rest of the team’s ideas, or they perceive a risk or obstacle that others miss, and won’t let it go. Be careful not to misidentify Contrarians as toxic. Teams need Contrarians to avoid groupthink. However, it may be worth having a discussion about rotating the role.

Ally
Safety

If a person really is showing toxic behavior, they may ignore the team’s feedback, or refuse to adjust to the team’s psychological safety needs. If that happens, they are no longer a good match for the team. Sometimes, it’s just a personality clash, and they’ll do well on another team.

At this point, it’s time to bring in your manager, or whoever assigns team membership. Explain the situation. Good managers understand that every team member’s performance depends on every other team member. An effective leader will step in to help the team. For them to do so, the team needs to inform them of what they need, as well as the steps they’ve already taken to encourage changes in behavior.

Some managers may resist removing a person from the team, especially if they identify the team member as a “star performer.” They could suggest the team should accommodate the behavior instead. Unfortunately, this tends to damage the team’s performance as a whole. Ironically, it can make the “star performer” seem like even more of a star, as they push the people around them down.

In this situation, you can only decide for yourself whether the benefits from being part of the team are worth the toxic behavior you experience. If they’re not, your best option is to move to another team or organization.

Questions

Isn’t it important that a team have one leader—a “single, wringable neck?” How does that work with leaderful teams?

A “single, wringable neck” is a satisfying way to simplify a complex problem, but it’s not so satisfying for the person whose neck is being wrung. It’s also contrary to the Agile ideal of collective ownership (see “Key Idea: Collective Ownership” on page XX). The team as a whole is responsible. There’s no scapegoat to take the fall when things go wrong, or reap the rewards when things go well, because success and failure is the result of a complex interaction between multiple participants and factors. Every team member’s contribution is vital.

This isn’t just abstract philosophy. Leaderful teams do better work, and develop into high-performing teams more quickly. Sharing leadership builds stronger teams.

What if I don't have the skills to help improve our team dynamics?

If you’re not comfortable working on teamwork skills, that’s okay. You can still help. Watch for the folks who adopt the “collaboration” leadership roles. Make sure you support their efforts. If your team doesn't have members willing to assume those roles, talk with your manager or sponsor about providing a coach or other team member skilled in team dynamics. (See “Coaching Skills” on page XX.)

Prerequisites

Allies
Energized Work
Whole Team
Team Room
Management

For these ideas to become reality, both your team and organization need to be on board. Team members need to be energized and motivated to do good work together. It won’t work if people are just interested in punching a clock and being told what to do. Similarly, your organization needs to invest in teamwork. This includes creating a whole team, a team room, and an Agile-friendly approach to management.

Indicators

When your team has healthy team dynamics:

  • Team members enjoy coming to work.

  • Team members say they can rely on their teammates to follow through on their commitments, or communicate when they can’t.

  • Team members trust that everyone on the team is committed to achieving the team’s purpose.

  • Team members know each other’s strengths and support each other’s limits.

  • Team members work well together and celebrate progress and successes.

Alternatives and Experiments

The material in this practice only represents a tiny portion of the valuable knowledge available about teams, team dynamics, managing conflicts, leadership, and many more topics that affect team effectiveness. The references throughout this practice and in the “Further Reading” section have a wealth of information. But even that only begins to scratch the surface. Ask a mentor for their favorites. Keep learning and experimenting. It’s a lifelong journey.

Further Reading

Keith Sawyer has spent his career exploring creativity, innovation, and improvisation, and their roots in effective collaborative effort. In Group Genius: The Creative Power of Collaboration [Sawyer 2017], he offers insightful anecdotes and ideas.

Roger Nierenberg’s memoir and instruction guide for leaders, Maestro: A Surprising Story about Leading by Listening [Nierenberg 2009], contributes “out of the box” ways of thinking about leadership. He also has a website with videos that demonstrate his techniques at http://www.musicparadigm.com/videos/.

The Wisdom of Teams: Creating the High Performance Organization [Katzenbach and Smith 2015] is the classic, foundational book about high-performing teams, their characteristics, and the environments that help them flourish.

Shared Leadership: Reframing the Hows and Whys of Leadership [Pearce and Conger 2002] is a compilation of the best ideas about leaderful teams and organizations. It can be a challenging read, but it’s well worth exploring to expand your ideas about who, and what, is a leader.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Chapter: Improvement (introduction)

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 20, 2021

Improvement

At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly.

Manifesto for Agile Software Development

Feedback and adaptation is central to Agile, and that applies to the team’s approach to Agile itself. Although you might start with an off-the-shelf Agile method, every team is expected to customize their method for themselves.

As with everything else in Agile, this customization happens through iteration, reflection, and feedback. Emphasize the things that work; improve the things that don’t. These practices will help you do so:

  • “Retrospectives” on page XX helps your team continually improve.

  • “Team Dynamics” on page XX improves your team’s ability to work together.

  • “Impediment Removal” on page XX focuses your team’s improvement efforts where they’ll make the most difference.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: Safety

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 12, 2021

Safety

Audience
Whole Team

XXX with Gitte Klitgaard

We share conflicting viewpoints without fear.

In 2012, Google launched Project Aristotle, an internal research effort intended to identify why some teams excelled and others did not. They looked at a number of factors: team composition, socialization outside of work, educational background, extroversion vs. introversion, colocation vs. remote, seniority, team size, individual performance, and more. None of them made a significant difference to effectiveness. Not even seniority or individual performance.

What mattered? Psychological safety.

Of the five key dynamics of effective teams that the researchers identified, psychological safety was by far the most important. The Google researchers found that individuals on teams with higher psychological safety are less likely to leave Google, they’re more likely to harness the power of diverse ideas from their teammates, they bring in more revenue, and they’re rated as effective twice as often by executives. [Google 2021]

Understanding Team Effectiveness

Although Google’s findings have brought psychological safety into the limelight, it’s not a new idea. It was originally introduced in 1965 by Edgar Schein and Warren Bennis, in the context of making personal and organizational changes. “In order for [discomfort] to lead to an increased desire to learn rather than heightened anxiety... An environment must be created with maximum psychological safety.” [Schein and Bennis 1965] (p. 44)

Understanding Psychological Safety

Psychological safety—often abbreviated to just “safety,” because modern offices have physical safety covered—is the ability to be yourself without fear of negative consequences, whether to your career, status, or self-image. It’s the ability to propose ideas, ask questions, raise concerns, and even make mistakes without being punished or humiliated.1

1The first half of this definition of psychological safety (“ability to be yourself”) is based on [Kahn 1990]. The second half (“ability to propose ideas”) is based on [Edmonson 2014].

Safety means team members are safe to disagree.

Safety doesn’t mean your team has no conflicts. It means the exact opposite. It means that everyone on your team is able to express their opinion without fear of retribution or belittlement. They are safe to disagree with each other. And they do. It may be uncomfortable, yet it is still safe.

Through that creative tension, they consider ideas that might have been forgotten. They take into account objections that could have been swept under the rug. In the end, everyone’s voice is heard, and that creates better results.

How to Create Safety

Safety is very individual. It’s context-based and intangible. An exchange that’s safe for some participants can still feel unsafe to others. For example, you might start a conversation with a bit of small talk: “What did you do this weekend?” One man speaks up confidently: “I went to the mountains with my wife.” Another is reluctant to speak. He spent the weekend with his boyfriend, and he worries that bringing it up will lead to uncomfortable conversations about his sexual orientation.

Past experiences are a factor, too. I (Gitte) worked with a 60-year-old who always avoided mentioning the gender of his husband. He said “my partner,” rather than “my husband,” and never used pronouns. Intellectually, he knew that I was comfortable with his relationship and would never treat him badly, but he grew up in a time when being gay was punished, and instinctively protected himself.

People’s personalities, the neurodiversity in the team, and other differences in the way people think also play a part. Does that mean safety is impossible? Not at all! But it does mean there’s no magic answer. You can do everything right, and people can still feel unsafe. You can’t force safety on anyone. You can only create a situation where safety is possible, and you can have discussions to figure out what safety means to your team.

The following techniques will help create safety.

Enable all voices

One of the big benefits of safety is that it creates space for everyone's voices. By feeling safe, team members speak up, disagree, suggest new ideas, bring up problems and in general bring in options. It doesn’t mean all ideas are implemented; it means that your team has considered more options before making a decision—options you might not otherwise see.

Even if people feel safe enough to speak up, some people are naturally shy, have social anxiety, or are just uncomfortable speaking up in group settings. You can make it easier for them to participate by taking these differences into account.

One way is to start each meeting with a brief check-in. It can be small, such as “Say one word about your mood today,” or “Tell us what the weather outside your window looks like right now.” When a person has spoken once in a meeting, it’s safer for them to speak again later. Be sure to give the option to pass, too. It shows that it’s also safe to not speak up.

Another option is to split large discussions into small groups of 2-4 people each. One person from each group shares the group’s conclusions with everyone else. This allows people who are uncomfortable speaking up in large settings have a voice, without requiring them to speak to the larger group.

Be open about mistakes

When we make a mistake, it’s easy to want to defend ourselves, especially if we’ve just committed a social faux pas. Resist the urge to ignore or cover up your mistakes. Instead, admit them. You’ll make it safe for others to admit their mistakes too.

Matt Smith has a technique called “the failure bow.” [Smith 2012] It works best when it’s a shared team norm. When you make a mistake, stand up and stretch your hands high in the air. “I failed!” Say it with a big smile. Everyone else will smile, too, and maybe even applaud. Make failures fun. It takes the sting out of it.

Some people will have trouble admitting mistakes. A person may blame themselves for a mistake, then assume that the team will hate them for it, and they will get fired. I’ve done this myself, as a recovering perfectionist.

In other words, although you can create an environment where it’s perfectly safe to speak up about mistakes, some people may still feel unsafe making mistakes. Allow people to share their mistakes, or not, in the way that works best for them.

Be curious

Show genuine interest in other people’s opinions. If someone is quiet or reluctant to speak, ask them what they think. It lets them know their voice has value. But keep in mind that they may not feel safe to be called upon in a group setting. If you’re in doubt, take the discussion to a smaller setting—perhaps just the two of you.

Listen to understand, not to respond.

Listen to understand, not to respond. It’s all too easy to focus on what you want to say next, rather than listen to what the other person is saying. If you already have your next question or statement lined up, you’re listening to respond. Instead, focus on what they’re saying and what they’re trying to convey.

Learn how to give and receive feedback

In an effective team, disagreements and conflicting opinions are not only normal, they’re expected. They’re how the best ideas emerge. Make disagreements safe by focusing on things and ideas, not the people suggesting them. Use an old improv trick: say “yes, and...” to build on each others’ ideas.

For example, if someone is proposing a change to the code, but didn’t consider error handling, don’t say “You forgot to include error handling.” That puts the focus on them, and what they forgot to do. Instead, focus on the idea, and build on it. “Where should we put error handling?” Or, “Let’s add error handling here.”

Some disagreements will be personal. For example, someone might make an insensitive joke. Diana Larsen provides the following process for giving interpersonal feedback:

  1. Create an opening. Ask permission to provide feedback. Don’t blindside them. “Georgetta, can I give you feedback about something you said in today’s standup meeting?”

  2. Describe the behavior. Be specific about what happened. “Today in the stand-up meeting, you made a joke about short people not getting dates. That was your third short joke this week.”

  3. State the impact. Describe how it affected you. “I’m sensitive about my height, and although I laughed it off, I felt down all morning.”

  4. Make the request. Explain what you would like to change, encourage, or discourage. “I’d like you to stop making short jokes.”

  5. Listen to the response. The other person will respond. Listen to understand and let them finish their thoughts.

  6. Negotiate next steps. Focus on what you can both do going forward, with an eye towards building the relationship. “I love your sense of humor, and I hope you’ll keep making jokes about other things. I’m working on being less sensitive, but it’s not easy. I appreciate you agreeing to make this change for me.”

Be sure to give feedback to encourage behavior you want to see as well as to behavior you want to change.

People may need a few days to digest feedback, especially if it’s something serious. Don’t expect them to respond right away. A lot of people find it hard to receive positive feedback, too. It took me a few years of consciously training myself before I was able to take positive feedback gracefully, and I still have trouble on bad days.

Receiving interpersonal feedback can be particularly uncomfortable if you’ve done something that hurt someone’s feelings. When that happens, avoid being defensive or minimizing the other person’s concerns. Don’t make a non-apology, such as “I’m sorry if you felt offended.” Instead, acknowledge your error and make amends. “I didn’t intend to upset you with my jokes, but I did. I apologize. I’ll avoid those sorts of jokes in the future. Please remind me if I slip up.”

Consider establishing working agreements around giving and receiving interpersonal feedback. “Right-Size Conflicts with Feedback” on page XX has suggestions.

Use empathy

People are prone to the Fundamental Attribution Error: we tend to assume people do things because of their underlying personality, not the situation they’re in. For example, if we cut someone off on the highway, it’s because we almost missed our exit. If someone else cuts us off, it’s because they’re a bad driver with no respect for others.

When you disagree with someone, assume positive intent

When you disagree with someone, put yourself in their shoes. Rather than assuming malice or incompetence, assume positive intent: the other person is just as smart and well-meaning as you are, but coming to a different conclusion. Try to understand their position and why their conclusion is different from yours.

You can develop your empathy by roleplaying disagreements after the fact. Ask someone to listen as you explain the disagreement. Explain it from your point of view, then from the other person’s point of view. Make the best, most reasonable argument you can for their position.

Agile Conversations [Squirrel and Fredrick 2020] is an excellent resource for understanding the impact of your conversations and how to be more effective.

Allow yourself to be vulnerable

Share personal details and allow people to see your vulnerabilities. This can start small: a hobby, favorite toy as a kid, or pet. This creates relationships and trust. Over time, as you grow to trust your teammates, you can open up further.

Whole humans go to work. In other words, things that happen at home affect us at work, too, as do all the other things that make us who we are. Sharing a good day, or bad day, helps people understand your mood, which creates safety. For example, if you didn’t get enough sleep, you might be grumpy. Sharing that information will help people understand that you’re not upset with them... you’re just grumpy.

In 2007, I was under examination for uterine cancer. I panicked and cried, and told my team. It was very uncomfortable, but it was safe. When it was time to go to the doctor for the examination, three people on the team called me at home to make sure I had someone to take me there. They knew I lived alone, and they supported me. This was a wonderful feeling. (Eventually, the diagnosis came back. There was no cancer.)

Leaders’ Role

People in a position of power have an outsized effect on safety. That includes traditional sources of power, such as a team lead or manager, but it also includes informal power, such as when a senior developer speaks to a junior developer.

If you’re in a position of power, your words and actions have more weight. Take this seriously. It means that you can’t speak as casually as you might like, at least not at first. Learn to read the room: pay attention to how your words and actions affect others.

The following techniques will help you create safety in your teams.

Model the behaviors you want to see

Demonstrate all the behaviors you want to see from the rest of the team. Enable everyone’s voice, be open about your mistakes, be curious, give feedback, show empathy, and allow yourself to be vulnerable. It’s not enough to tell people to be safe, or to assume that they’re safe. Show it.

When discussing mistakes, be careful not to place or take blame. Don’t say things like, “Oh, I made a mistake, I’m so stupid.” That sends the message that mistakes are stupid. Instead, frame the work as a vehicle for learning, where mistakes are expected, and learning is part of the result. “I made a mistake, and this is what I learned.”

Be explicit about expectations

Agile teams are self-organizing and own their work, but that doesn’t mean they have no expectations or direction. Be clear about what you expect from your fellow team members, and what you can do to help. During meetings and activities, such as a retrospective, start by clearly stating your expectations for the session.

Don’t shy away from conflict
Safety doesn’t mean people always get what they want.

Safety doesn’t mean people always get what they want. It means everyone’s opinion has been taken into consideration.

In an effort to create a sense of safety, some teams will engage in false harmony instead. They’ll avoid conflict and suppress dissenting opinions. This may feel safe, but the conflict doesn’t go away. It just bubbles and grows under the surface.

Some leaders make the mistake of emphasizing positivity on their teams. They say things like, “don’t be so negative,” or “be a team player”—by which they mean, “go along with the rest of the group.” This tells people they aren’t safe to express disagreement.

Instead, if you notice people suppressing their opinions, ask them to share. If people seem to be indulging in false harmony, ask them about the downsides of an idea. If you see a problem that no one else mentions, bring it up, in a kind way.

At the same time, be prepared to be fallible. Don’t focus on being right; focus on getting every opinion out in the open, where they can be discussed, debated, and improved.

Ally
Team Dynamics

False harmony and groupthink are a common challenge for teams in the “Norming” stage of development. See “Norming” on page XX for more ideas.

Questions

No matter what I do, one of our other team members doesn’t like to speak up. How can I help them?

As with many team issues, listening is a good first step. Talk to them about why they don’t like to speak up. Be sure to emphasize that this isn’t a problem they need to solve, but a challenge for the team. You want to make it easier for them to contribute their voice.

As you discuss options, keep in mind that, while you want to ensure their voice is heard, it doesn’t have to be their literal voice. For some people, carefully organizing their thoughts in writing is more comfortable than sharing in the spur of the moment.

Another option is for them to talk their ideas through with a buddy on the team. This can be a way to practice what they want to say in advance, or they can ask their buddy to represent their point of view.

I’ve seen something that I know impacted a person, but I’m concerned that they don’t feel safe enough to speak up. What should I do?

It depends on the severity of the situation, and also whether you feel safe enough to act on it yourself.

In most cases, start by talking to the person that was impacted. Ask them if they are okay and if they’d like to talk about it. If you feel safe to do so, offer to bring up the problem on their behalf. Even if you don’t, it can help the other person to know they’ve been seen, and that someone cares.

If I feel that something has crossed a line, I’ll speak up on the spot. For example, imagine Von says something in a meeting, but is ignored. Normally, I’d discuss this with Von afterwards. But later in the meeting, Max repeats what Von said, and this time, everyone listens. At this point, I’ll step in. I have a standard phrase for this situation: “Max, I really like how you rephrased what Von said earlier.”

Isn’t our time better spent getting work done, not talking about our feelings?

Simple, repetitive tasks may not need safety, but software development requires creativity, learning, and thinking. Your team needs all brains and voices to create the best possible results. Without safety, you could miss out on ideas, people could be reluctant to point out mistakes, and opportunities could be ignored due to perceived risk.

Remember the Project Aristotle findings. Safety was the number one predictor of team performance at Google. And even if that wasn’t true, work is an enormous part of your life. Don’t you want it to be an environment where you and your teammates can express yourselves without fear?

Prerequisites

Almost any team can establish psychological safety. Some organizations have a culture of suppressing disagreement or placing blame, and they can make safety difficult, but you can still establish a pocket of safety within your team.

If your team is remote, be careful about recording conversations. If the team has safety and people express themselves freely, you don’t want the broader organization to use that against team members in the future. If you can, set your chat room to delete old conversations, and default to not recording video calls, unless there’s a specific reason to do so.

Indicators

When your team has psychological safety:

  • Team members speak up about mistakes and what they’ve learned.

  • Team members disagree constructively.

  • Team members bring up ideas and problems.

  • The team creates better products that incorporate more ideas.

  • It’s easier to hire and retain people with diverse backgrounds.

Alternatives and Experiments

Psychological safety is a way for people to learn, share their learning, disagree, and speak up. This practice has focused on ways to change your environment to make that easier.

An alternative is to try to change the people, rather than the environment. In theory, you can work on developing their courage, so they’re comfortable speaking up even when they don’t feel safe. But I don’t recommend this approach. People can only change themselves; you can’t do it for them. Even if they do have the courage to speak up when they’re feeling unsafe, that fear will reduce their creativity.

Experiments, on the other hand, are a great way to improve safety. Be explicit about the experiments you try: even framing something as an experiment, with a follow-up date, can create more safety, because people know the change can be reverted if it doesn’t work. Create a culture of trying new ideas, both regarding safety and within your team in general.

Ally
Retrospectives

One way to get started is to conduct a retrospective with “safety” as a theme. Discuss what you’ve noticed regarding safety, on this team and others, and choose some experiments to try.

Further Reading

The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth [Edmonson 2018] is the latest book from Amy Edmonson, Professor at Harvard Business School. It’s a good book about the many aspects of psychological safety she has researched.

“Building a Psychologically Safe Workplace,” [Edmonson 2014] Amy Edmonson’s TEDx talk, is a nice, quick introduction to the topic.

Time to Think: Listening to Ignite the Human Mind [Kline 2015] is about creating space and time to think at work. The book includes practical advice that I personally include in most of my meetings.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: Incident Analysis

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

Incident Analysis

Audience
Whole Team

We learn from failure.

Despite your best efforts, your software will sometimes fail to work as it should. Some failures will be minor, such as a typo on a web page. Others will be more significant, such as code that corrupts customer data, or an outage that prevents customer access.

Some failures are called bugs or defects; others are called incidents. The distinction isn’t particularly important. Either way, once the dust has settled and things are running smoothly again, you need to figure out what happened and how you can improve. This is incident analysis.

The details of how to respond during an incident are out of the scope of this book. For an excellent and practical guide to incident response, see Site Reliability Engineering: How Google Runs Production Systems [Beyer et al. 2016], particularly chapters 12-14.

The Nature of Failure

Failure is a consequence of your entire development system.

It’s tempting to think of failure as a simple sequence of cause and effect—A did this, which led to B, which led to C—but that’s not what really happens.1 In reality, failure is a consequence of the entire development system in which work is performed. (Your development system is every aspect of how you build software, from tools to organizational structure. It’s in contrast to your software system, which is the thing you’re building.) Each failure, no matter how minor, is a clue about the nature and weaknesses of that development system.

1My discussion of the nature of failure is based on [Woods et al. 2010] and [Dekker 2014].

Failure is the result of many interacting events. Small problems are constantly occurring, but the system has norms that keep them inside a safe boundary. A programmer makes an off-by-one error, but their pairing partner suggests a test to catch it. An on-site customer explains a story poorly, but notices the misunderstanding during customer review. A team member accidentally erases a file, but continuous integration rejects the commit.

When failure occurs, it’s not because of a single cause, but because multiple things go wrong at once. A programmer makes an off-by-one error, and their pairing partner was up late with a newborn and doesn’t notice, and the team is experimenting with less frequent pair swaps, and the canary server alerts were accidentally disabled. Failure happens, not because of problems, but because the development system—people, processes, and business environment—allows problems to combine.

Furthermore, systems exhibit a drift toward failure. Ironically, for teams with a track record of containing failures, the threat isn’t mistakes, but success. Over time, as no failures occur, the team’s norms change. For example, they might make pairing optional so people have more choice in their work styles. Their safe boundaries shrink. Eventually, the failure conditions—which existed all along!—combine in just the right way to exceed these smaller boundaries, and a failure occurs.

It’s hard to see the drift toward failure. Each change is small, and is an improvement in some other dimension, such as speed, cost, convenience, or customer satisfaction. To prevent drift, you have to stay vigilant. Past success doesn’t guarantee future success.

Small failures are a “dress rehearsal” for large failures.

You might expect large failures to be the result of large mistakes, but that isn’t how failure works. There’s no single cause, and no proportionality. Large failures are the result of the same systemic issues as small failures. That’s good news, because it means small failures are a “dress rehearsal” for large failures. You can learn just as much from them as you do from big ones.

Therefore, treat every failure as an opportunity to learn and improve. A typo is still a failure. A problem detected before release is still a failure. No matter how big or small, if your team thinks something is “done,” and it later needs correction, it’s worthy of analysis.

But it goes even deeper. Failures are a consequence of your development system, as I said, but so are successes. You can analyze them, too.

Conducting the Analysis

Ally
Retrospectives

Incident analysis is a type of retrospective. It’s a joint look back at your development system for the purpose of learning and improving. Textbook. As such, an effective analysis will involve the five stages of a retrospective: [Derby and Larsen 2006]

  1. Set the stage;

  2. Gather data;

  3. Generate insights;

  4. Decide what to do; and

  5. Close the retrospective.

Include your whole team in the analysis, along with anyone else involved in the incident response. Avoid including managers and other observers; you want participants to be able to speak up and admit mistakes openly, and that requires limiting attendance to just the people who need to be there. When there’s a lot of interest in the analysis, you can produce an incident report, as I’ll describe later.

The time needed for the analysis session depends on the number of events leading up to the incident. A complex outage could have dozens of events and take several hours. A simple defect, though, might only have a handful of events, and could take 30-60 minutes. You’ll get faster with experience.

In the beginning, and for sensitive incidents, a neutral facilitator should lead the session. The more sensitive the incident, the more experienced the facilitator needs to be.

This practice, as with all practices in this book, is focused on the team level—incidents which your team can analyze mainly on their own. You can also use it to conduct an analysis of your team’s part in a larger incident.

Set the stage
Ally
Safety

Because incident analysis involves a critical look at successes and failures, it’s vital for every participant to feel safe to contribute, including having frank discussions about the choices they made. For that reason, start the session by reminding everyone that the goal is to use the incident to better understand the way you create software—the development system of people, processes, expectations, environment, and tools. You’re not here to focus on the failure itself or to place blame, but instead to learn how to make your development system more resilient.

Ask everyone to confirm that they can abide by that goal and assume good faith on the part of everyone involved in the incident. Norm Kerth’s prime directive is a good choice:

Regardless of what we discover, we must understand and truly believe that everyone did the best job he or she could, given what was known at the time, his or her skills and abilities, the resources available, and the situation at hand. [Kerth 2001] (ch. 1)

In addition, consider establishing the Vegas Rule: What’s said in the analysis session, stays in the analysis session. Don’t record the session, and ask participants to agree not to repeat any personal details shared in the session.

If the session includes people outside the team, or if your team is new to working together, you might also want to establish working agreements for the session. (See “Create Working Agreements” on page XX.)

Gather data

Once the stage has been set, your first step is to understand what happened. You’ll do so by creating an annotated, visual timeline of events.

Stay focused on facts, not interpretations.

People will be tempted to interpret the data at this stage, but it’s important to keep everyone focused on “just the facts.” They’ll probably need multiple reminders as the stage progresses. With the benefit of hindsight, it’s easy to fall into the trap of critiquing people’s actions, but that won’t help. A successful analysis focuses on understanding what people actually did, and how your development system contributed to them doing those things, not what they could have done differently.

To create the timeline, start by creating a long horizontal space on your virtual whiteboard. If you’re conducting the session in person, use blue tape on a large wall. Divide the timeline into columns representing different periods in time. The columns don’t need to be uniform; weeks or months are often best for the earlier part of the timeline, while hours or days might be more appropriate for the moments leading up to the incident.

Have participants use simultaneous brainstorming to think of events relevant to the incident. (See “Work Simultaneously” on page XX.) Events are factual, non-judgemental statements about something that happened, such as “Deploy script stops all ServiceGamma instances,” “ServiceBeta returns 418 response code,” “ServiceAlpha doesn’t recognize 418 response code and crashes,” “On-call engineer is paged about system downtime,” and “On-call engineer manually restarts ServiceGamma instances.” (You can use people’s names, but only if they’re present and agree.) Be sure to capture events that went well, too, not just ones that went poorly.

Software logs, incident response records, and version control history are all likely to be helpful sources of inspiration. Write each event on a separate sticky note and add it to the board. Use the same color sticky for each event.

Afterwards, invite everyone to step back and look at the big picture. Which events are missing? Working simultaneously, look at each event and ask, “What came before this? What came after?” Add each additional event as another sticky note. You might find it helpful to show before/after relationships with arrows.

How was the automation used? Configured? Programmed?

Be sure to include events about people, not just software. People’s decisions are an enormous factor in your development system. Find each event which involves automation your team controls or uses, then add preceding events about how people contributed to that event. How was the automation used? Configured? Programmed? Be sure to keep these events neutral in tone and blame-free. Don’t second-guess what people should have done; only write what they actually did.

For example, the event “Deploy script stops all ServiceGamma instances” might be preceded by “Op misspells --target command-line parameter as --tagret” and “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found,” which in turn is preceded by “Team decides to clean up deploy script’s command-line processing.”

Events can have multiple predecessors feeding into the same event. Each predecessor can occur at different points in the timeline. For example, the event “ServiceAlpha doesn’t recognize 418 response code and crashes” could have three predecessors: “ServiceBeta returns 418 response code“ (immediately before); “Engineer inadvertently disables ServiceAlpha top-level exception handler” (several months earlier); and “Engineer programs ServiceAlpha to throw exception when unexpected response code received” (a year earlier).

As events are added, encourage participants to share recollections of their opinions and emotions at the time. Don’t ask people to excuse their actions; you’re not here to assign blame. Ask them to explain it was like to be there, in the moment, when the event occurred. This will help your team understand the social and organizational aspects of your development system—not just what choices were made, but why.

Ask participants to add additional stickies, in another color, for those thoughts. For example, if Jarrett says, “I had concerns about code quality, but I felt like I had to rush to meet our deadline,” he could write two sticky notes: “Jarrett has concerns about code quality” and “Jarrett feels he has to rush to meet deadline.” Don’t speculate about the thoughts of people who aren’t present, but you can record things they said at the time, such as “Layla says she has trouble remembering deploy script options.”

Keep these notes focused on what people felt and thought at the time. Your goal is to understand the system as it really was, not to second-guess people.

Finally, ask participants to highlight important events in the timeline—the ones which seem most relevant to the incident. Doublecheck whether people have captured all their recollections about those events.

Generate insights

Now it’s time to turn facts into insights. In this stage, you’ll mine your timeline for clues about your development system. Before you begin, give people some time to study the board. This can be a good point to call for a break.

The events aren’t the cause of failure; they’re a symptom of your system.

Begin by reminding attendees about the nature of failure. Problems are always occurring, but they don’t usually combine in a way that leads to failure. The events in your timeline aren’t the cause of the failure; they’re a symptom of how your development system functions. It’s that deeper system that you want to analyze.

Look at the events you identified as important during the “gather data” activity. Which of them involved people? To continue the example, you would choose the “Op misspells --target command-line parameter as --tagret” and “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found” events, but not “Deploy script stops all ServiceGamma instances,” because that event happened automatically.

Working simultaneously, assign one or more of the following categories2 to each people-involved event. Write each category on a third color of sticky note and add it to the timeline.

2The event categories were inspired by [Woods et al. 2010] and [Dekker 2014].

  • Knowledge and mental models: Involves information and decisions within the team involved in the event. For example, believing a service maintained by the team will never return a 418 response.

  • Communication and feedback: Involves information and decisions from outside the team involved in the event. For example, believing a third-party service will never return a 418 response.

  • Attention: Involves the ability to focus on relevant information. For example, ignoring an alert because several other alerts are happening at the same time, or misunderstanding the importance of an alert due to fatigue.

  • Fixation and plan continuation: Persisting with an assessment of the situation in the face of new information. For example, during an outage, continuing to troubleshoot a failing router after logs show that traffic successfully transitioned over to the backup router. Also involves continuing with an established plan; for example, releasing on the planned date despite beta testers saying the software isn’t ready.

  • Conflicting goals: Choosing between multiple goals, some of which may be unstated. For example, deciding to prioritize meeting a deadline over improving code quality.

  • Procedural adaptation: Involves situations in which established procedure doesn’t fit the situation. For example, abandoning a checklist after one of the steps reports an error. A special case is the responsibility-authority double bind, which requires people to make a choice between being punished for violating procedure or following a procedure that doesn’t fit the situation.

  • User experience: Involves interactions with computer interfaces. For example, providing the wrong command-line argument to a program.

  • Write-in: You can create your own category if the event doesn’t fit into the ones I’ve provided.

The categories apply to positive events, too. For example, “engineer programs back-end to provide safe default when ServiceOmega times out” is a “knowledge and mental models” event.

After you’ve categorized the events, take a moment to consider the whole picture again, then break into small groups to discuss each event. What does each one say about your development system? Focus on the system, not the people.

For example, the event, “Engineer inadvertently changes deploy script to stop all instances when no --target parameter found,” sounds like it’s a mistake on the part of the engineer. But the timeline reveals that Jarrett, the engineer in question, felt he had to rush to meet a deadline, even though it reduced code quality. That means it was a “conflicting goals” event, and it’s really about how priorities are decided and communicated. As the team discusses the event, they realize they all feel pressure from sales and marketing to prioritize deadlines over code quality.

Incident analysis always looks at the system, not individuals.

On the other hand, let’s say the timeline analysis revealed Jarrett also misunderstood the behavior of the team’s command-line processing library. That would make it a “knowledge and mental models” event, too, but you still wouldn’t put the blame on Jarrett. Incident analysis always looks at the system, not individuals. Individuals are expected to make mistakes. In this case, a closer look at the event reveals that, although the team used test-driven development and pairing for production code, they didn’t apply that standard to their scripts. They didn’t have any way to prevent mistakes in their scripts, and it was just a matter of time before one slipped through.

After the breakout groups have had a chance to discuss the events—for speed, you might want to divide the events among the groups, rather than having each group discuss every event—come together to discuss what you’ve learned about the system. Write each conclusion on a fourth color of sticky note and put it on the timeline next to the corresponding event. Don’t make suggestions, yet; just focus on what you’ve learned. For example, “No systematic way to prevent programming mistakes in scripts,” “Engineers feel pressured to sacrifice code quality,” and “Deploy script requires long and error-prone command-line.”

Decide what to do

You’re ready to decide how to improve your development system. You’ll do so by brainstorming ideas, then choosing a few of your best options.

Start by reviewing the overall timeline again. How could you change your system to be more resilient? Consider all possibilities, without worrying about feasibility. Brainstorm simultaneously onto a table or a new area of your virtual whiteboard. You don’t need to match your ideas to specific events or questions. Some will address multiple things at once. Questions to consider include:3

3Thanks to Sarah Horan Van Treese for suggesting most of these questions.

  • How could we prevent this type of failure?

  • How could we detect this type of failure earlier?

  • How could we fail faster?

  • How could we reduce the impact?

  • How could we respond faster?

  • Where did our safety net fail us?

  • What related flaws should we investigate?

To continue the example, your team might brainstorm ideas such as, “stop committing to deadlines,” “update forecast weekly and remove stories that don’t fit deadline,” “apply production coding standards to scripts,” “perform review of existing scripts for additional coding errors,” “simplify deploy script’s command line,” and “perform user experience review of command-line options across all of the team’s scripts.” Some of these ideas are better than others, but at this stage, you’re generating ideas, not filtering them.

Once you have a set of options, group them into “control,” “influence,” and “soup” circles, depending on your team’s ability to make them happen, as described in “Circles and Soup” on page XX. Have a brief discussion about the options’ pros and cons. Then use dot voting, followed by a consent vote (see “Work Simultaneously” on page XX and “Seek Consent” on page XX), to decide which options your team will pursue. You can choose more than one.

As you think about what to choose, remember that you shouldn’t fix everything. Sometimes, introducing a change adds more risk or cost than the thing it solves. In addition, although every event is a clue about the behavior of your development system, not every event is bad. For example, one of the example events was, “Engineer programs ServiceAlpha to throw exception when unexpected response code received.” Even though that event directly led to the outage, it made diagnosing the failure faster and easier. Without it, something still would have gone wrong, and it would have taken longer to solve.

Close the retrospective

Incident analysis can be intense. Close the retrospective by giving people a chance to take a breath and gently shift back to their regular work. That breath can be metaphorical, or you can literally suggest that people stand up and take a deep breath.

Start by deciding what to keep. A screenshot or photo of the annotated timeline and other artifacts is likely to be useful for future reference. First, invite participants to review the timeline for anything they don’t want shared outside the session. Remove those stickies before taking the picture.

Next, decide who will follow through on your decisions, and how. If your team will be producing a report, decide who will participate in writing it.

Finally, wrap up by expressing appreciations to each other for your hard work.4 Explain the exercise and provide an example: “(Name), I appreciate you for (reason).” Sit down and wait. Others will speak up as well. There’s no requirement to speak, but leave plenty of time at the end—a minute or so of silence—because people can take a little while to speak up.

4The “appreciations” activity is based on [Derby and Larsen 2006] (pp. 119-120).

Some people find the “appreciations” activity uncomfortable. An alternative activity is for each participant to take turns saying a few words about how they feel now the analysis is over. It’s okay to pass.

Afterwards, thank everybody for their participation. Remind them of the Vegas rule (don’t share personal details without permission), and end.

Organizational Learning

Organizations will often require a report about the incident analysis’s conclusions. It’s usually called a post-mortem, although I prefer the more neutral incident report.

In theory, part of the purpose of the incident report is to allow other teams to use what you’ve learned to improve their own development systems. Unfortunately, people tend to dismiss lessons learned by other teams. This is called distancing through differencing. [Woods et al. 2010] (ch. 14) “Those ideas don’t apply to us, because we’re an internally-facing team, not externally-facing.” Or, “we have microservices, not a monolith.” Or, “we work remotely, not in-person.” It’s easy to latch onto superficial differences as a reason to avoid change.

Preventing this distancing is a matter of organizational culture, which puts it out of the scope of this book. Briefly, though, people have the most appetite for learning and change after a major failure. Other than that, I’ve had the most success from making the lessons personal. Show how the lessons affect things your audience cares about.

This is easier in conversation than with a written document. In practice, I suspect—but don’t know for sure!—that the most effective way to get people to read and apply the lessons from an incident report is to tell a compelling, but concise story. Make the stakes clear from the outset. Describe what happened and allow the mystery to unfold. Describe what you learned about your system and explain how it affects other teams, too. Describe the potential stakes for other teams and summarize what they can do to protect themselves.

Incident Accountability

Another reason organizations want incident reports is to “hold people accountable.” This tends to be misguided at best.

That’s not to say teams shouldn’t be accountable for their work. They should be! And by performing an incident analysis and working on improving their development system, including working with the broader organization to make changes, they are showing accountability.

Searching for someone to blame makes big incidents worse.

Searching for a “single, wringable neck,” in the misguided parlance of Scrum, just encourages deflection and finger-pointing. It may lower the number of reported incidents, but that’s just because people hide problems. The big ones get worse.

“As the incident rate decreases, the fatality rate increases,” reports The Field Guide to Understanding ‘Human Error’, speaking about construction and aviation. “[T]his supports the importance... of learning from near misses. Suppressing such learning opportunities, at whatever level, and by whatever means, is not just a bad idea. It is dangerous.” [Dekker 2014] (ch. 7)

If your organization understands this dynamic, and genuinely wants the team to show how it’s being accountable, you can share what the incident analysis revealed about your development system. (In other words, the final stickies from the “Generate Insights” activity.) You can also share what you decided to do to improve the resiliency of your development system.

Often, your organization will have an existing report template that you’ll have to conform to. Do your best to avoid presenting a simplistic cause-and-effect view of the situation, and be careful to show how the system, not individuals, allowed problems to turn into failures.

Questions

What if we don’t have time to do a full analysis of every bug and incident?

Incident analysis doesn’t have to be a formal retrospective. You can use the basic structure to explore possibilities informally, with just a few people, or even in the privacy of your own thoughts, in just a few minutes. The core point to remember is that events are symptoms of your underlying development system. They’re clues to teach you how your system works. Start with the facts, discuss how they change your understanding of your development system, and only then think of what to change.

Prerequisites

Ally
Safety

Successful incident analysis depends on psychological safety. Unless participants feel safe to share their perspective on what happened, warts and all, you’ll have trouble achieving a deep understanding of your development system.

The broader organization’s approach to incidents has a large impact on participants’ safety. Even companies that pay lip-service to “blameless postmortems” have trouble moving from a simplistic cause-effect view of the world to a systemic view. They tend to think of “blameless” as “not saying who’s to blame,” but to be truly blameless, they need to understand that no one is to blame. Failures and successes are a consequence of a complex system, not specific individuals’ actions.

You can conduct a successful incident analysis in organizations that don’t understand this, but you’ll need to be extra careful to establish ground rules about psychological safety, and ensure people who have a blame-oriented worldview don’t attend. You’ll also need to exercise care to make sure the incident report, if there is one, is written with a systemic view, not a cause-effect view.

Indicators

When you conduct incident analyses well:

  • Incidents are acknowledged and even incidents with no visible impact are analyzed.

  • Team members see the analysis as an opportunity to learn and improve, and even look forward to it.

  • Your system’s resiliency improves over time, resulting in fewer escaped defects and production outages.

  • No one is blamed, judged, or punished for the incident.

Alternatives and Experiments

Many organizations approach incident analysis through the lens of a standard report template. This tends to result in shallow “quick fixes” rather than a systemic view, because people focus on what they want to report rather than studying the whole incident. The format I’ve described will help people expand their perspective before coming to conclusions. Conducting it as a retrospective will also ensure everybody’s voices are heard, and the whole team buys in to the conclusions.

Many of the ideas in this practice are inspired by books from the field of Human Factors and Systems Safety. Those books are concerned with life-and-death decisions, often made under intense time pressure, in fields such as aviation. Software development has different constraints, and some of those transplanted ideas may not apply perfectly.

In particular, the event categories I’ve provided are likely to have room for improvement. I suspect there’s room to split the “knowledge” category into several categories. Don’t just add categories arbitrarily, though. Check out the further reading section and ground your ideas in the underlying theory, first.

The retrospective format I’ve provided has the most room for experimentation. It’s easy to fixate on solutions or simplistic cause-effect thinking during an incident analysis, and the format I’ve provided is designed to avoid this mistake. But it’s just a retrospective. It can be changed. After you’ve conducted several analyses using the format I’ve provided, see what you can improve by experimenting with new activities. For example, can you conduct parts of the “Gather Information” stage asynchronously? Are there better ways to analyze the timeline during the “Generate Insights” stage? Can you provide more structure to “Decide What to Do?”

Finally, incident analysis isn’t limited to analyzing incidents. You can also analyze successes. As long as you’re learning about your development system, you’ll achieve the same benefits. Try conducting an analysis of a time when the team succeeded under pressure. Find the events that could have led to failure, and the events that prevented failure from occurring. Discover what that teaches you about your system’s resiliency, and think about how you can amplify that sort of resiliency in the future.

Further Reading

The Field Guide to Understanding ‘Human Error’ [Dekker 2014] is a surprisingly easy read that does a great job of introducing the theory underlying much of this practice.

Behind Human Error [Woods et al. 2010] is a much denser read, but it covers more ground than The Field Guide. If you’re looking for more detail, this is your next step.

The previous two books are based on Human Factors and System Safety research. The website learningfromincidents.io is dedicated to bringing those ideas to software development. At the time of this writing, it’s fairly thin, but its heart is in the right place. I’m including it in the hopes that it will have more material by the time you read this.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: Blind Spot Discovery

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

Blind Spot Discovery

Audience
Testers, Whole Team

We discover the gaps in our thinking.

Fluent Delivering teams are very good at building quality into their code, as you saw in the previous practice. But nobody’s perfect, and teams have blind spots. Blind spot discovery is a way of finding those gaps.

To find blind spots, look at the assumptions your team makes, and consider the pressures and constraints they’re under. Imagine what risks the team might be facing and what they might falsely believe to be true. Make a hypothesis about the blind spots that could occur as a result and investigate to see if your guess is right. Testers tend to be particularly good at this.

When you find a blind spot, don’t just fix the problem you found. Fix the gap. Think about how your approach to development allowed the bug to occur, then change your approach to prevent that category of bugs from happening again, as described in “Prevent Systemic Errors” on page XX.

Validated Learning

When people think about bugs, they often think about logic errors, user interface errors, or production outages. But the blind spot I see most often is more fundamental, and more subtle.

More than anything else, teams build the wrong thing. To use Lean Startup terminology, they lack product-market fit. I think this happens because so many teams think of their job as building the product they were told to build. They act as obedient order-takers: a software factory designed to ingest stories in one end and plop software out the other.

Nobody really knows what you should build, not even the people asking for it.

Don’t just assume that your team should build what it’s told to build. Instead, assume the opposite: nobody really knows what you should build, not even the people asking for it. Your team’s job is to take those ideas, test them, and learn what you should really build. To paraphrase The Lean Startup [Ries 2011], the fundamental activity of an Agile team is to turn ideas into products, observe how customers and users respond, and then decide whether to pivot or persevere. This is called validated learning.

Allies
Purpose
Visual Planning
Real Customer Involvement
Incremental Requirements

For many teams, the first time they test their ideas is when they release their software. That’s pretty risky. Instead, use Ries’ Build-Measure-Learn loop:

  1. Build. Look at your team’s purpose and plan. What core assumptions are you making about your product, customers, and users? Choose one to test, then think, “What’s the smallest thing we can put in front of real customers and users?” It doesn’t have to be a real product—in some cases, a mock-up or paper prototype will work—and you don’t have to involve every user, but you do need to involve people who will actually buy or use your product.

  2. Measure. Prior to showing people what you’ve built, decide what data you need to see in order to say that the assumption has been proven or disproven. The data can be subjective, but the measurement should be objective. For example, “70% of our customers say they like us” is an objective measurement of subjective data.

  3. Learn. Your measurement will either validate your hypothesis or disprove it. If you validated the hypothesis, continue with the next one. If you disproved your hypothesis, change your plans accordingly.

For example, one team’s purpose was to improve surgical spine care outcomes. They planned to do so by building a tool to give clinical leads a variety of views into surgical data. One of their core assumptions was that clinical leads would trust the underlying data presented by the tool. But the data could be poor, and leads tended to be skeptical.

To test their assumption, the team decided to: (build) use real data from seven clinics to create a mock-up of the tool; (measure) show it to those seven clinics’ leads; (learn) if at least five said the data was of acceptable quality, the assumption would be validated. If not, they would come up with a new plan.

Validated learning is one of the hallmarks of an Optimizing team.

Validated learning is one of the hallmarks of an Optimizing team. Depending on your organizational structure, you may not be able to use it to its fullest. Still, the fundamental idea applies. Don’t just assume delivering stories will make people happy. Do everything you can to check your assumptions and get feedback.

For more about validated learning and the related concept of customer discovery, see [Ries 2011] and [Blank and Dorf 2020].

Exploratory Testing

Ally
Test-Driven Development

Test-driven development ensures that programmers’ code does what they intended it to do, but what if the programmer’s intention is wrong? For example, a programmer might think the correct way to determine the length of a string in JavaScript is to use string.length, but that can result in counting six letters in the word “naïve.”1

1The count can be off because string.length reports the number of codepoints (sort of), not the number of graphemes—what people usually think of as characters—and it’s possible for Unicode to store the grapheme “ï” as two codepoints: a normal “i” plus a “combining diaeresis” (the umlaut). String manipulation has similar issues. Reversing a string containing the Spanish flag will convert Spain 🇪🇸 to Sweden 🇸🇪, which is sure to surprise beach-goers.

Exploratory testing is a technique for finding these blind spots. It’s a rigorous approach to testing which involves “designing and executing tiny experiments in rapid succession using the results from the last experiment to inform the next.” [Hendrickson 2013] (ch. 1) It involves these steps:

  1. Charter. Start by deciding what you’re going to explore, and why. A new technology the team recently adopted? A recently-released user interface? A critical piece of security infrastructure? Your charter should be general enough to give you an hour or two of work, and specific enough to help you focus.

  2. Observe. Use the software. You’ll often do so via the user interface, but you can also use tools to explore APIs and network traffic, and you can also observe hidden parts of the system, such as logs and databases. Look for two things: anything that’s out of the ordinary, and anything you can modify, such as a URL, form field, or file upload, that might lead to unexpected behavior. Take notes as you go, so you can retrace your steps when necessary.

  3. Vary. Don’t just use the software normally; push its boundaries. Put an emoji in a text field. Enter a size as zero or negative. Upload a zero-byte file, a corrupted file, or an “exploding” zip file that expands to terabytes of data. Edit URLs. Modify network traffic. Artificially slow down your network, or write to a file system with no free space.

As you go, use your observations and your understanding of the system to decide what to explore next. You’re welcome to supplement those insights by looking at code and production logs. If you’re exploring security capabilities, you can use your team’s threat model as a source of inspiration, or create your own. (See “Threat Modeling” on page XX.)

There’s much more to exploratory testing than I have room for in this book. For more detail, and a great set of heuristics about what to vary, see [Hendrickson 2013].

Chaos Engineering

In a large networked systems, failures are an everyday occurrence. Your code must be programmed to be resilient to those failures, and that requires careful attention to error handling and resilience. Unfortunately, error handling is a common blind spot for less experienced programmers and teams, and even experienced teams can’t predict every failure mode of a complex system.

Chaos engineering can be considered a specialized form of exploratory testing which focuses on system architecture.2 It involves deliberately injecting failures into running systems—often, live production systems—to learn how they respond to failure. Although this may seem risky, it can be done in a controlled way. It allows you to identify issues that only appear as a result of complex interactions.

2Some people in the chaos engineering community object to use of the word “testing” in relationship to chaos engineering. They prefer the term “experiment.” I think that objection misunderstands the nature of testing. As Elisabeth Hendrickson writes in Explore It!: “This is the essence of testing: designing an experiment to gather empirical evidence to answer a question about a risk.” [Hendrickson 2013] (ch. 1) That’s exactly what chaos engineering is, too.

Chaos engineering is similar to exploratory testing in that it involves finding opportunities to vary normal behavior. Rather than thinking in terms of unexpected user input and API calls, though, you think in terms of unexpected system behavior: nodes crashing, high latency network links, unusual responses, and so forth. Fundamentally, it’s about conducting experiments to determine if your software system is as resilient as you think it is.

  1. Start with an understanding of your system’s “steady state.” What does your system look like when it’s functioning normally? What assumptions does your team or organization make about your system’s resiliency? Which of those would be most valuable to check first? When you perform the experiment, how will you know if it succeeded or failed?

  2. Prepare to vary the system in some way: remove a node, introduce latency, change network traffic, artificially increase demand, etc. (If this is your first test, start small, so the impact of failure is limited.) Form a hypothesis about what will happen. Make a plan for aborting the experiment if things go badly wrong.

  3. Make the change and observe what happens. Was your hypothesis correct? Is the system still performing adequately? If not, you’ve identified a blind spot. Either way, discuss the results with your team and improve your collective mental model of the system. Use what you’ve learned to decide which experiment you should conduct next.

Many of the stories surrounding chaos engineering involve automated tools, such as Netflix’s Chaos Monkey. To use chaos engineering within your team, though, don’t focus on building tools. It’s more valuable to conduct a breadth of experiments than to automatically repeat a single experiment. You’ll need some basic tooling to support your work, and that tooling will grow in sophistication over time, but try to conduct the broadest set of experiments you can for the least amount of work.

The principles of chaos engineering can be found at principlesofchaos.org. For a book-length treatment of the topic, see [Rosenthal and Jones 2020].

Penetration Testing and Vulnerability Assessments

Although exploratory testing can find some security-related blind spots, security-sensitive software warrants testing by experts.

Penetration testing, also known as pentesting, involves having people attempt to defeat the security of your system in the way a real attacker would. It can involve probing the software your team writes, but it also considers security more holistically. Depending on the rules of engagement you establish, it can involve probing your production infrastructure, your deployment pipeline, human judgment, and even physical security such as locks and doors.

Penetration testing requires specialized expertise. You’ll typically need to a hire an outside firm. It’s expensive, and your results depend heavily on the skill of the testers. Exercise extra diligence when hiring a penetration testing firm, and remember that the individuals performing the test are at least as important as the firm you choose.

Vulnerability assessments are a less-costly alternative to penetration testing. Although penetration testing is technically a type of vulnerability assessment, most firms advertising “vulnerability assessments” perform an automated scan.

Some vulnerability assessments perform static analysis of your code and dependencies. If they’re fast enough, they can be included in your continuous integration build. (If not, you can use multistage integration, as described in “Multistage Integration Builds” on page XX.) Over time, the assessment vendor will add additional scans to the tool, which will alert your team to new potential vulnerabilities.

Other assessments probe your live systems. For example, a vendor might probe your servers for exposed administration interfaces, default passwords, and vulnerable URLs. You’ll typically receive a periodic report (such as once a month) describing what the assessment found.

Vulnerability assessments can be noisy. You’ll typically need someone with security skills to go through them and triage their findings, and you may need some way of safely ignoring irrelevant findings. For example, one assessment scanned for vulnerable URLs, but it wasn’t smart enough to follow HTTP redirects. Every month, it reported every URL in its scan as a vulnerability, even though the server was just performing a blanket redirect.

In general, start by using threat modeling (see “Threat Modeling” on page XX) and security checklists, such as the OWASP Top 10 at owasp.org, to inform your programming and exploratory testing efforts. Use automated vulnerability assessments to address additional threats and find blind spots. Then turn to penetration testing to learn what you missed.

Questions

Should these techniques be performed individually, in pairs, or as a mob?

Allies
Pair Programming
Mob Programming

It’s up to your team. It’s fine to perform these techniques individually. On the other hand, pairing and mobbing are good for coming up with ideas and disseminating insights, and they can help break down the barriers that tend to form between testers and other team members. Experiment to see which approach work best for your team. It might vary by technique.

Won’t the burden of blind spot discovery keep getting bigger as the software gets bigger?

It shouldn’t. Blind spot discovery isn’t like traditional testing, which tends to grow along with the codebase. It’s for checking assumptions, not validating an ever-increasing codebase. As the team addresses blind spots and gains confidence in its ability to deliver high-quality results, the need for blind spot discovery should go down, not up.

Prerequisites

Any team can use these techniques. But remember that they’re for discovering blind spots, not checking that the software works. Don’t let them be a bottleneck. You don’t need to check before you release, and you don’t need to check everything. You’re looking for flaws in your development system, not your software system.

Ally
No Bugs

On the other hand, releasing without additional checks requires your team to be able to produce code with nearly no bugs. If you aren’t there yet, or if you just aren’t ready to let go, it’s okay to delay releasing until you’ve checked for blind spots. Just be sure not to use blind spot discovery as a crutch. Fix your development system so you can release without manual testing.

Indicators

When you use blind spot discovery well:

  • The team trusts the quality of their software.

  • The team doesn’t use blind spot discovery as a form of pre-release testing.

  • The number of defects found in production and by blind-spot techniques declines over time.

  • The amount of time needed for blind spot discovery declines over time.

Alternatives and Experiments

Allies
No Bugs
Test-Driven Development

This practice is based on an assumption that it’s possible for developers to build systems with nearly no bugs—that defects are the result of fixable blind spots, not a lack of manual testing. So the techniques are geared around finding surprises and testing hypotheses.

The most common alternative is traditional testing: building repeatable test plans that comprehensively validate the system. Although this may seem more reliable, those test plans have blind spots of their own. Most of the tests end up being redundant to the tests programmers create with test-driven development. At best, they tend to find the same sorts of issues that exploratory testing does, at much higher cost, and they rarely expose problems that the other techniques reveal.

In terms of experimentation, the techniques I’ve described are just the beginning. The underlying idea is to validate your hidden assumptions. Anything you can do to identify and test those assumptions is fair game. One additional technique you can explore is called fuzzing. It involves generating large amounts of inputs and monitoring for unexpected results.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: No Bugs

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

No Bugs

Audience
Whole Team

We release with confidence.

If you’re on a team with a bug count in the hundreds or thousands, the idea of “no bugs” probably sounds ridiculous. I’ll admit it: No bugs is an ideal to strive for, not something your team will completely achieve. There will always be some bugs. (Or defects; I use “bug” and “defect” interchangeably.)

But you can get closer to the “no bugs” ideal than you might think. Consider Nancy van Schooenderwoert’s experience with Extreme Programming. She led a team of novices working on a real-time embedded system for farm combines: a concurrent system written in C, with some assembly. If that’s not a recipe for bugs, I don’t know what is. According to Nancy van Schoonderwoert’s analysis of data by Capers Jones, the average team developing this software would produce 1,035 defects and deliver 207 to their customer.

Here’s what actually happened:

The GMS team delivered this product after three years of development, having encountered a total of 51 defects during that time. The open bug list never had more than two items at a time. Productivity was measured at almost three times the level for comparable embedded software teams. The first field test units were delivered after approximately six months into development. After that point, the software team supported the other engineering disciplines while continuing to do software enhancements. [Van Schooenderwoert 2006]

Embedded Agile Project by the Numbers with Newbies

Over three years, they generated 51 defects and delivered 21 to their customer. That’s a 95% reduction in generated defects and a 90% reduction in delivered defects.

We don’t have to rely on self-reported data. QSM Associates is a well-regarded company that performs independent audits of software development teams. In an early analysis of a company practicing a variant of XP, they reported an average reduction from 2,270 defects to 381 defects, an 83% decrease. Furthermore, the XP teams delivered 24% faster with 39% fewer staff. [Mah 2006]

More recent case studies confirmed those findings. QSM found 11% defect reduction and 58% schedule reduction on a Scrum team; 75% defect reduction and 53% schedule reduction on an XP team; and 75% defect reduction and 30% schedule reduction in a multi-team analysis of thousands of developers. [Mah 2018]

Eliminate errors at their source rather than finding and fixing them after the fact.

How do you achieve these results? It’s a matter of building quality in, rather than testing defects out. Eliminate errors at their source rather than finding and fixing them after the fact.

Don’t Play the Bug Blame Game

Is it a bug or a feature?

I’ve seen companies waste inordinate amounts of time on this question. In an attempt to apportion blame “correctly,” they make elaborate distinctions between bugs, defects, errors, issues, anomalies, and, of course... unintentional features.

What really matters is whether you will do or not do something.

None of that matters. What really matters is whether you will do or not do something. If there’s something your team needs to do—whatever the reason—it’s a story in your plan.

For the purposes of this chapter, I’m defining bugs as follows:

A bug is anything your team considers “done” that later needs correction.

For your purposes, though, even that distinction doesn’t matter. If something needs work, it gets a story card. That’s all there is to it.

How to Build Quality In

Before I describe how to build quality in, I need to clarify what I mean by “quality.” Roughly speaking, quality can be divided into “internal quality” and “external quality.” Internal quality is the way your software is constructed. It’s things like good names, clear software design, and simple architecture. Internal quality controls how easy your software is to extend, maintain, and modify. The better the internal quality, the faster you go.

The better your internal quality, the faster you go.

External quality is the user-visible aspects of your software. It’s your software’s user experience, functionality, and reliability. You can spend infinite amounts of time on these things. The right amount of time depends on your software, market, and value. Figuring out the balance is a question of product management.

“Building quality in” means keeping internal quality as high as possible while keeping external quality at the level needed to satisfy your stakeholders. That involves keeping your design clean, delivering the stories in your plan, and revising your plan when your external quality falls short of what’s needed.

Now, let’s talk about how to do it. To build quality in and achieve zero bugs, you’ll need to prevent four types of errors.

Prevent programmer errors

Programmer errors occur when a programmer knows what to program, but makes a mistake. It could be an incorrect algorithm, a typo, or some other mistake made while translating ideas to code.

Allies
Test-Driven Development
Energized Work
Pair Programming
Mob Programming
Alignment
Done Done

Test-driven development is your defect-elimination workhorse. Not only does it ensure that you program what you intended to, it gives you a comprehensive regression suite you can use to detect future errors.

To enhance the benefits of test-driven development, work sensible hours and use pairing or mobbing to bring multiple perspectives to bear on every line of code. This improves your brainpower, which helps you make fewer mistakes and allows you to see mistakes more quickly.

Supplement these practices with good standards (which are part of your alignment discussion) and a “done done” checklist. These will help you remember and avoid common mistakes.

Prevent design errors

Design errors create breeding grounds for bugs. According to Barry Boehm, 20% of the modules in a program are typically responsible for 80% of the defects. [Boehm 1987] It’s an old statistic, but it matches my experience with modern software, too.

Even with test-driven development, design errors will accumulate over time. Sometimes a design that looks good when you first create it won’t hold up over time. Sometimes a shortcut that seems like an acceptable compromise will come back to bite you. Sometimes your requirements change and your design no longer fits.

Allies
Collective Code Ownership
Simple Design
Incremental Design
Reflective Design
Slack

Whatever the cause, design errors manifest as complicated, confusing code that’s hard to get right. Although you could take a week or two off to fix these problems, it’s better to continuously improve your internal quality.

Use collective code ownership to give programmers the right and responsibility to fix problems wherever they live. Use evolutionary design to continuously improve your design. Make time for improvements by including slack in your plans.

Prevent requirements errors

Requirements errors occur when a programmer creates code that does exactly what they intended it to do, but their intention was wrong. Perhaps they misunderstood what they were supposed to do, or perhaps nobody really understood what needed to be done. Either way, the code works, but it doesn’t do the right thing.

Allies
Whole Team
Purpose
Context
Team Room
Ubiquitous Language
Customer Examples
Incremental Requirements
Stakeholder Demos
Stories
Done Done

A cross-functional, whole team is essential for preventing requirements errors. Your team needs to include on-site customers with the skills to understand, decide, and explain the software’s requirements. Clarifying the team’s purpose and context is vital to this process.

A shared team room is also important. When programmers have a question about requirements, they need to be able to turn their head and ask. Use a ubiquitous language to help programmers and on-site customers understand each other, and supplement your conversations with customer examples.

Confirm that the software does what it needs to do with frequent customer reviews and stakeholder demos. Perform those reviews incrementally, as soon as programmers have something to show, so misunderstandings and refinements are discovered early, in time to be corrected. Use stories to focus the team on customers’ perspective. Finally, don’t consider a story “done done” until on-site customers agree it’s done.

Prevent systemic errors

If everyone does their job perfectly, these practices yield software with no defects. Unfortunately, perfection is impossible. Your team is sure to have blind spots: subtle areas where they make mistakes, but they don’t know it. These blind spots lead to repeated, systemic errors. They’re “systemic” because they’re a consequence of your entire development system: your team, its process, the tools you use, the environment you work in, and more.

Escaped defects are a clear signal of problems in paradise. Although errors are inevitable, most are caught quickly. Defects found by end-users have “escaped.” Every escaped defect indicates a need to improve your development system.

Allies
Blind Spot Discovery
Incident Analysis

Of course, you don’t want your end-users to be your beta testers. That’s where blind spot discovery comes in. It’s a variety of techniques, such as chaos engineering and exploratory testing, for finding gaps in your understanding. I discuss them in the next practice.

Some teams use these techniques to check the quality of their software system: they’ll code a story, search for bugs, fix them, and repeat. But to build quality in, treat your blind spots as a clue about how to improve your development system, not just your software system. The same goes for escaped defects. They’re all clues about what to improve.

Incident analysis helps you decipher those clues. No matter the impact, if your team thought something was done, and it later needs fixing, it can benefit from incident analysis. This applies to well-meaning mistakes, too: if everybody thinks a particular new feature is a great idea, and it turns out to enrage your customers, it deserves just as much analysis as a production outage.

When you find a bug, write a test and fix the bug, but then fix the underlying system. Even if it’s just in the privacy of your thoughts, think about how you can improve your design and process to prevent that type of bug from happening again.

Fix Bugs Immediately

Do. Or do not. There is no //TODO.

As the great Master Yoda never said, “Do. Or do not. There is no //TODO.”

Each defect is the result of a flaw that’s likely to breed more mistakes. Improve quality and productivity by fixing them right away.

Allies
Collective Code Ownership
Team Room

Fixing bugs quickly requires the whole team to participate. Programmers, use collective code ownership so anyone can fix each bug. Customers and testers, personally bring new bugs to the attention of a programmer and help them reproduce it. This is easiest when the team shares a team room.

In practice, it’s not possible to fix every bug right away. You may be in the middle of something else when you learn about a bug. When this happens to me, I ask my navigator to make a note. We come back to it 10-20 minutes later, when we come to a good stopping point.

Allies
Slack
Task Planning
Visual Planning

Some bugs are too big to fix quickly. For these, I gather the team for a quick huddle. We collectively decide if we have enough slack to fix the bug and still meet our other commitments. If we do, we create tasks for the bug, put them in our plan, and people volunteer for them as normal. (If you’re using estimates, these tasks don’t get estimates or count toward your capacity.)

If there isn’t enough slack to fix the bug, decide as a team whether it’s important enough to fix before your next release. If it is, create a story for it and schedule it immediately for your next iteration or story slot. If it isn’t, add it to your visual plan in the appropriate release.

Bugs which aren’t important enough to fix should be discarded. If you can’t do that, the bug needs to be fixed. The “fix,” though, can be a matter of documenting a workaround, or making a record that you decided not to fix the bug. An issue tracker might be the right way to do so.

Testers’ Role

Because fluent Delivering teams build quality in, rather than testing defects out, people with testing skills shift left. Instead of focusing their skills on the completed product, they focus on helping the team build a quality product from the beginning.

In my experience, some testers are business-oriented: they’re very interested in getting business requirements right. They work with on-site customers to uncover all the nit-picky details the customers would otherwise miss. They’ll often prompt people to think about edge cases during requirements discussions.

Other testers are more technically-oriented. They’re interested in test automation and non-functional requirements. These testers act as technical investigators for the team. They create the testbeds that look at issues such as scalability, reliability, and performance. They review logs to understand how the software system works in production. Through these efforts, they help the team understand the behavior of their software and decide when to devote more effort to operations, security, and “nonfunctional” stories.

Ally
Blind Spot Discovery

Testers also help the team identify blind spots. Although anybody on the team can use blind spot discovery techniques, people with testing skills tend to be particularly good at it.

‘Tude

Bugs are for other people.

I encourage an attitude among my teams—a bit of eliteness, even snobbiness. It goes like this: “Bugs are for other people.”

If you do everything I’ve described, bugs should be a rarity. Your next step is to treat them that way. Rather than shrugging your shoulders when a bug occurs—“Oh yeah, another bug, that’s what happens in software”—be shocked and dismayed. Bugs aren’t something to be tolerated; they’re a sign of underlying problems to be solved.

Ally
Pair Programming
Mob Programming
Team Room
Collective Code Ownership

Ultimately, “no bugs” is about establishing a culture of excellence. When you learn about a bug, fix it right away, then figure out how to prevent that type of bug from happening again.

You won’t be able to get there overnight. All the practices I’ve described take discipline and rigor. They’re not necessarily difficult, but they break down if people are sloppy or don’t care about their work. A culture of “no bugs” helps the team maintain the discipline required, as do pairing or mobbing, a team room, and collective ownership.

You’ll get there eventually. Agile teams can and do achieve nearly zero bugs. You can too.

Questions

How do we prevent security defects and other challenging bugs?

Allies
Build for Operation
Done Done
Alignment
Blind Spot Discovery

Threat modeling (see “Threat Modeling” on page XX) can help you think of security flaws in advance. Your “done done” checklist and coding standards can remind you of issues to address. That said, you can only prevent bugs you think to prevent. Security, concurrency, and other difficult problem domains may introduce defects you never considered. That’s why blind spot discovery is also important.

How should we track our bugs?

You shouldn’t need a bug database or issue tracker for new bugs, assuming your team isn’t generating a lot of bugs. (If they are, focus on solving that problem first.) If a bug is too big to fix right away, turn it into a story, and track its details in the same way you handle other requirements details.

How long should we work on a bug before we turn it into a story?

Ally
Slack

It depends on how much slack you have. Early in an iteration, when there’s still a lot of slack, I might spend half a day on a defect before turning it into a story. Later, when there’s less slack, I might only spend ten minutes on it.

We have a lot of legacy code. How can we adopt a “no bugs” policy without going mad?

It will take time. Start by going through your bug database and identifying the ones you want to fix in the current release. Schedule at least one to be fixed every week, with a bias towards fixing them sooner rather than later.

Ally
Incident Analysis

Every week or two, randomly choose a recent bug to subject to an incident analysis, or at least an informal one. This will allow you to gradually improve your development system and prevent bugs in the future.

Prerequisites

“No Bugs” is about a culture of excellence. It can only come from within the team. Managers, don’t ask your teams to report defect counts, and don’t reward or punish them based on the number of defects they have. You’ll just drive the bugs underground, and that will make quality worse, not better. I’ll discuss this further in “Incident Accountability” on page XX.

Achieving the “no bugs” ideal depends on a huge number of Agile practices—essentially, every Focusing and Delivering practice in this book. Until your team reaches fluency in those practices, don’t expect dramatic reductions in defects.

Conversely, if you’re using the Focusing and Delivering practices, more than a few new bugs per month may indicate a problem with your approach. You’ll need time to learn the practices and refine your process, of course, but you should see an improvement in your bug rates within a few months. If you don’t, check the “Troubleshooting Guide” on page XX.

Indicators

When your team has a culture of “no bugs:”

  • Your team is confident in the quality of their software.

  • You’re comfortable releasing to production without a manual testing phase.

  • Stakeholders, customers, and users rarely encounter unpleasant surprises.

  • Your team spends their time producing great software instead of fighting fires.

Alternatives and Experiments

One of the revolutionary ideas Agile incorporates is that low-defect software can be cheaper to produce than high-defect software. This is made possible by building quality in. To experiment further, look at the parts of your process that check quality at the end, and think of ways to build that quality in from the beginning.

You can also reduce bugs by using more and higher quality testing to find and fix a higher percentage of bugs. However, this doesn’t work as well as building quality in from the beginning. It will also slow you down and make releases more difficult.

Some companies invest in separate QA teams in an effort to improve quality. Although occasional independent testing can be useful for discovering blind spots, a dedicated QA team isn’t a good idea. Paradoxically, it tends to reduce quality, because the development team then spends less effort on quality themselves. Elisabeth Hendrickson explores this phenomenon in her excellent article, “Better Testing, Worse Quality?” [Hendrickson 2000]

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Chapter: Quality (introduction)

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 19, 2021

Quality

For many people, “quality” means “testing,” but Agile teams treat quality differently. Quality isn’t something you test for; it’s something you build in. Not just into your code, but your entire development system: the way your team approaches its work, the way people think about mistakes, and even the way your organization interacts with your team.

This chapter has three practices to help your team dedicate itself to quality:

  • “No Bugs” on page XX builds quality in.

  • “Blind Spot Discovery” on page XX helps your team learn what they don’t know.

  • “Incident Analysis” on page XX focuses your team on systemic improvements.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: Evolutionary System Architecture

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

Evolutionary System Architecture

Audience
Programmers, Operations

We build our infrastructure for what we need today, without sacrificing tomorrow.

Allies
Simple Design
Incremental Design
Reflective Design

Simplicity is at the heart of Agile, as discussed in “Key Idea: Simplicity” on page XX. It’s particularly apparent in the way fluent Delivering teams approach evolutionary design: they start with the simplest possible design, layer on more capabilities using incremental design, and constantly refine and improve their code using reflective design.

What about your system architecture? By system architecture, I mean the components that make up your deployed system. The applications and services built by your team, and the way they interact. Your network gateways and load balancers. Even third-party services. What about them? Can you start simple and evolve from there?

That’s evolutionary system architecture, and I’ve seen it work on small systems. But system architectures are slow to evolve, so there isn’t the same depth of industry experience behind evolutionary system architecture that there is behind evolutionary design. Use your own judgement about how and when it should be applied.

I make a distinction between system architecture and application architecture. Application architecture is the design of your code, including decisions about how to call other components in your system. It’s discussed in “Application Architecture” on page XX. This practice discusses system architecture: decisions about which components to create and use, and the high-level relationships between them.

Are You Really Gonna Need It?

The software industry is awash with stories of big companies solving big problems. Google has a database that is replicated around the world! Netflix shut down their data centers and moved everything to the cloud! Amazon mandated that every team publish their services, and by doing so, created the billion-dollar Amazon Web Services business!

It’s tempting to imitate these success stories, but the problems those companies were solving are probably not the problems you need to solve. Until you’re the size of Google, or Netflix, or Amazon... YAGNI. You aren’t gonna need it.

Consider Stack Overflow, the popular programming Q&A site. They serve 1.3 billion pages per month, rendering each one in less than 19 milliseconds. How do they do it?1

1Stack Overflow publishes their system architecture and performance stats at https://stackexchange.com/performance, and Nick Craver has an in-depth series discussing their architecture at [Craver 2016]. The quoted data was accessed on May 4th, 2021.

  • 2 HAProxy load balancers, one live and one failover, peaking at 4,500 requests per second and 18% CPU utilization;

  • 9 IIS web servers, peaking at 450 requests per second and 12% CPU utilization;

  • 2 Redis caches, one master and one replica, peaking at 60,000 operations per second and 2% CPU utilization;

  • 2 SQL Server database servers for Stack Overflow, one live and one standby, peaking at 11,000 queries per second and 15% CPU utilization;

  • 2 more SQL Server servers for the other Stack Exchange sites, one live and one standby, peaking at 12,800 queries per second and 14% CPU utilization;

  • 3 tag engine and ElasticSearch servers, averaging 3,644 requests per minute and 3% CPU utilization for the custom tag engine, and 34 million searches per day and 7% utilization for ElasticSearch;

  • 1 SQL Server database server for HTTP request logging;

  • 6 LogStash servers for all other logs; and

  • Approximately the same thing again in a redundant data center (for disaster recovery).

As of 2016, they deployed the Stack Overflow site 5-10 times per day. The full deploy took about eight minutes. Other than localization and database migration, deploying was a matter of looping through the servers, taking each out of the HAProxy rotation, copying the files, and putting it back into rotation. Their primary application is a single multi-tenant monolith which serves all Q&A websites.

This is a decidedly unfashionable approach to system architecture. There’s no containers, no microservices, no autoscaling, not even any cloud. Just some beefy rack-mounted servers, a handful of applications, and file-copy deployment. It’s straightforward, robust, and it works.

One of the common reasons people provide for complex architecture is “scaling.” But Stack Overflow is one of the 50 highest-trafficked web sites in the world.2 Is your architecture more complex than theirs? If so... does it need to be?

2Stack Overflow traffic ranking retrieved from alexa.com on May 6th, 2021.

Aim for Simplicity

I’m not saying you should blindly copy Stack Overflow’s architecture. Don’t blindly copy anyone! Instead, look at the problems you need to solve. (“A more impressive resume” doesn’t count.) What’s the simplest architecture that will solve them?

One way to approach this question is to start with an idealized view of the world. You can use this thought experiment for existing architectures as well as new ones.

1. Start with an ideal world

Imagine your components are coded magically and perfectly, but not instantly. The network is completely reliable, but network latency still exists. Every node has unlimited resources. Where would you put your component boundaries?

You might need to segregate components into separate servers or geographies for security, regulatory, and latency reasons. You’re likely to make a distinction between client-side processing and server-side processing. You’ll still save time and effort by using third-party components.

2. Introduce imperfect components and networks

Now remove the assumption of perfect components and networks. Components fail; networks go down. Now you need redundancy, which necessitates components for handling replication and failover. What’s the simplest way you can meet those needs? Can you reduce complexity by using a third-party tool or service? For example, Stack Overflow has to worry about redundant power supplies and generators. If you use a cloud provider, that’s their problem, not yours.

3. Limit resources

Next, remove the assumption of unlimited resources. You might need multiple nodes to handle load, along with components for load balancing. You might need to split a CPU-intensive operation out into its own component, and introduce a queue to feed it. You might need a shared cache, and a way to populate it.

Are you speculating about future load, or addressing real issues?

But be careful: are you speculating about future load, or addressing real issues based on real-world usage and trends? Can you simplify your architecture by using more capable hardware, or by waiting to address future loads?

4. Consider humans and teams

Finally, take out the idealized coding. Who will be coding each component? How will they coordinate with each other? Do you need to split components to make cross-team communication easier, or to limit the complexity of any one component? Think about how you can simplify these constraints, too.

Controlling Complexity

Some architectural complexity is necessary. Although your system might be simpler if you didn’t have to worry about load balancing or component failure, you do have to worry about those things. As Fred Brooks said in his famous essay, “No Silver Bullet: Essence and Accident in Software Engineering” [Brooks 1995], some complexity is essential. It can’t be eliminated.

But other complexity is accidental. Sometimes, you split a large component into multiple small components just to make the human side easier, not because it’s an essential part of the problem you’re solving. Accidental complexity can be removed, or at least reduced.

Evolutionary design

One of the most common reasons I see for splitting components is to prevent “big balls of mud.” Small components are simple and easy to maintain.

Small components tend to increase overall complexity.

Unfortunately, this doesn’t reduce complexity; it just moves it from application architecture to system architecture. In fact, splitting a large component into multiple small components tends to increase overall complexity. It makes individual components easier to understand, but cross-component interactions are worse. Error tracing is harder. Refactoring is harder. And distributed transactions... well, they’re best avoided entirely.

Allies
Simple Design
Incremental Design
Reflective Design

You can reduce the need to split a component by using evolutionary design. It allows you to create large components that aren’t big balls of mud.

Self-discipline

Another reason teams split their components is to provide isolation. When a component is responsible for multiple types of data, it’s tempting to tangle the data together, making it difficult to refactor later.

Of course, there’s no inherent reason data has to be tangled together. It’s just a design decision, and if you can design isolated components, you can design isolated modules within a single component. You can even have each module use a separate database. It’s not like network calls magically create good design!

Allies
Collective Code Ownership
Pair Programming
Mob Programming
Energized Work
Reflective Design

But network calls do enforce isolation. If you don’t use the network to enforce isolation, you need a team with self-discipline instead. Collective code ownership, pairing or mobbing, and energized work all help, and reflective design allows you to fix any mistakes that slip through.

Fast deployment

Large components are often difficult to deploy. In my experience, it’s not the deployment itself that’s difficult, but the build and tests that have to run before the component is deployed. This is doubly true if the component has to be tested manually.

Ally
Zero Friction
Test-Driven Development
Continuous Integration
Fast Reliable Tests

Address this problem by creating a zero friction build, introducing test-driven development and continuous integration, and creating fast, reliable tests. If your build and tests are fast, you don’t have to split a component just to make deployment easier.

Vertical scaling

To paraphrase Conway’s Law, organizations tend to ship their org charts. Many organizations default to scaling horizontally (see chapter “Scaling Agility”), which results in lots of small, isolated teams. They need correspondingly small components.

Vertical scaling enables your teams to work together on the same components. It gives you the ability to design your architecture to match the problem you’re solving, rather than designing your architecture to match your teams.

Refactoring System Architecture

I have a friend who works at a large, well-known company. Due to a top-down architectural mandate, his team of three programmers maintains 21 separate services—one for each entity they control. Twenty-one! We took some time to think about his team’s code could be simplified.

  • Originally, his team was required to keep each service in a separate git repository. They got permission to combine the services into a single monorepo. That allowed them to eliminate duplicated serialization/deserialization code and dramatically ease refactorings. Previously, a single change could result in 16 separate commits across 16 repositories. Now, it only takes one.

  • With a few exceptions, the CPU requirements of his team’s services are minimal. Thanks to a organization-wide service locator, the services could be combined into a single component without changing their endpoints. This would allow them to deploy to fewer VMs, lowering their cloud costs; replace network calls with function calls, speeding up response times; and simplify their end-to-end tests, making deployments easier and faster.

  • About half of his team’s services are only used within his team. Each service has a certain amount of boilerplate and overhead. That overhead could be eliminated if the internal services were turned into libraries. It would also eliminate a bunch of slow end-to-end tests.

All in all, his team could remove a lot of costs and development friction by simplifying their system architecture, if they could get permission to do it.

I can imagine several of these sorts of system-level refactorings. Unfortunately, they don’t yet have the rich history the rest of the ideas in this book do. The “microlith” refactorings are particularly unproven. So I’ll just provide brief sketches, without much detail. Treat them as a set of ideas to consider, not a cookbook to follow.

Multi-repo Components → Monorepo Components

If your team‘s components are located in several repositories, you can combine them into a single repository, and factor out shared code for common types and utilities.

Components → Microliths

If your team owns multiple components, you can combine them into a single component while keeping the basic architecture the same. Isolate them in separate directory trees and use a top-level interface file, rather than a server, to translate between your serialized payload and the component’s data structures. Replace the network calls between components with a function call, but keep the architecture the same in every other way, including the use of primitive data types rather than objects or custom types.

I call these in-process components microliths.3 You can see an example of this refactoring in episode 21 of [Shore 2020b]. They provide the isolation of a component without the operational complexity.

3I call them microliths because I originally envisioned them as a combination of the best parts of monoliths and microservices. “Microlith” is also a real word, referring to a tiny stone tool chipped off a larger stone, which almost works as a metaphor.

My microlith refactorings are the most speculative. I’ve only tried them on toy problems. I’m including them because they provide an intermediate step between components and modules.

Microliths → Modules

Microliths are strongly isolated. They’re effectively components running in a single process. That introduces some complexity and overhead.

If you don’t need such strong isolation, you can remove the top-level interface file and serialization/deserialization. Just call the microlith’s code normally. The result is a module. (Not to be confused with a source code file, which can also be called a module.)

A component composed of modules is typically called a modular monolith, but modules aren’t just for monoliths. You can use them in any component, no matter how big or small.

Modules → New Modules

If your modules have a lot of cross-module dependencies, you might be able to simplify them by refactoring their responsibilities. This is really a question of application architecture, not system architecture (see “Application Architecture” on page XX for more about evolving application architecture), but I’m including it because it can be an intermediate step in a larger system refactoring.

Big Ball of Mud → Modules
Allies
Incremental Design
Reflective Design

If you have a large component that’s turned into a mess, you can use evolutionary design to gradually convert it to modules, disentangling and isolating data as you go. Praful Todkar has a good example of doing so in [Todkar 2018]. This is also a matter of application architecture, not system architecture.

Modules → Microliths

If you want strong isolation, or you think you might want to split a large component into multiple small components, you can convert a module into a microlith. To do so, introduce a top-level interface file and serialize complex function parameters.

Treat the microlith as if it were a separate component. Callers should only call it through the top-level interface file, and should abstract those calls behind an infrastructure wrapper, as described in “Third-Party Components” on page XX. The microlith’s code should be similarly isolated; other than the common types and utilities a component might use, it should only reference other components and microliths, and only through their top-level interfaces. You might need to refactor your module to be more component-like first.

Network calls are far slower and less reliable than function and method calls. Converting a module to a microlith won’t guarantee your microlith will work well as a networked component. In theory, you could confirm your microliths will work as proper networked components by introducing a 1-2ms delay in your top-level API, or even random failures. In practice, that sounds ridiculous, and I’ve yet to try it.

Microliths → Components

If a microlith is suitable for use as a networked component, converting it into a component is fairly straightforward. It’s a matter of converting the top-level API file into a server, and converting callers to use network calls. This is easiest if you remembered to isolate their calls behind an infrastructure wrapper.

Converting a microlith to a component will likely require callers to introduce error handling, timeouts, retries, exponential backoff, and backpressure, in addition to the operational and infrastructure changes required by the new component. It’s a lot of work, but that’s the cost of networking.

Modules → Components

Rather than using microliths, you can jump straight from a module to a component. Although this can be done by extracting the code, I often see people rewriting modules instead. This is a common strategy when refactoring a big ball of mud, because the modules’ code often isn’t worth keeping. [Todkar 2018] demonstrates this approach.

Monorepo Components → Multi-repo Components

If you have multiple components in the same repository, you can extract them into separate repositories. One reason to do so is if you’re moving ownership of a component to another team. You might need to duplicate common types and utilities.

Compound refactorings

You’ll typically string these system-level refactorings together. For example, the most common approach I see is to clean up legacy code by using “Big Ball of Mud → Modules” and “Modules → Components.” Or, more compactly: Big Ball of Mud → Modules → Components.

Combining components is a similar operation in reverse: Multi-repo Components → Monorepo Components → Microliths → Modules.

If you have a bunch of components with tangled responsibilities, you might be able to refactor to new responsibilities instead of rewriting: Components → Microliths → Modules → New Modules → Microliths → Components.

Prerequisites

Ally
Impediment Removal

You’ll likely only be able to use the ideas in this practice with the components your team owns. Architectural standards and components owned by other teams are likely to be out of your direct control, but you might be able to influence people to make the changes you need.

Changes to system architecture depend on a close relationship between developers and operations. You’ll need to work together to identify a simple architecture for the system’s current needs, including peak loads and projected increases, and you’ll need to continue to coordinate as needs change.

Indicators

When you evolve your system architecture well:

  • Small systems have small architectures. Large systems have manageable architectures.

  • The system architecture is easy to explain and understand.

  • Accidental complexity is kept to a minimum.

Alternatives and Experiments

Allies
Simple Design
Incremental Design
Reflective Design

Many of the things people think of as “evolutionary system architecture” are actually just normal evolutionary design. For example, migrating a component from using one database to another is an evolutionary design problem, because it’s mainly about the design of a single component. The same is true for migrating a component from using one third-party service to another. Those sorts of changes are covered by the evolutionary design practices: simple design, incremental design, and reflective design.

Evolving system architecture means deliberately starting with simplest system possible and growing as your needs change. It’s an idea that has yet to be explored fully. Pick and choose the parts of this practice that work for your situation, then see how far you can push it. The underlying goal is to reduce developer and operations friction and make troubleshooting easier, without sacrificing reliability and maintainability.

Further Reading

Building Evolutionary Architectures [Ford et al. 2017] goes into much more detail about architectural options. It takes an architect-level view rather than the team-level view I’ve provided.

XXX Sarah Horan Van Treese recommends: Building Microservices by Sam Newman

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

AoAD2 Practice: Continuous Deployment

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

Continuous Deployment

Audience
Programmers, Operations

Our latest code is in production.

If you use continuous integration, your team has removed most of the risk of releasing. Done correctly, continuous integration means your team is ready to release at any time. You’ve tested your code and exercised your deployment scripts.

Ally
Continuous Integration

One source of risk remains. If you don’t deploy your software to real production servers, it’s possible that your software won’t actually work in production. Differences in environment, traffic, and usage can all result in failures, even in the most carefully-tested software.

Continuous deployment resolves this risk. It follows the same principle as continuous integration: by deploying small pieces frequently, you reduce the risk that a big change will cause problems, and you make it easy to find and fix problems when they do occur.

Although continuous deployment is a valuable practice for fluent Delivering teams, it’s optional. If your team is still developing their fluency, focus on the other practices first. Full adoption of continuous integration, including automated deployments to a test environment (which some people call “continuous delivery”), will give you nearly as much benefit.

How to Use Continuous Deployment

Allies
Zero Friction
Continuous Integration
No Bugs
Feature Flags
Build for Operation

Continuous deployment isn’t hard, but it has a lot of preconditions:

  • Create a zero-friction, zero-downtime deploy script that automatically deploys your code.

  • Use continuous integration to keep your code ready to release.

  • Improve quality to the point that your software can be deployed without manual testing.

  • Use feature flags or keystones to decouple deployments from releases.

  • Establish monitoring to alert your team of deployment failures.

Once these preconditions are met, enabling continuous deployment is just a matter of running deploy in your continuous integration script.

Ally
Whole Team

The details of your deploy script will depend on your organization. Your team should include people with operations skills who understand what’s required. If not, ask your operations department for help. If you’re on your own, Continuous Delivery [Humble and Farley 2010] and The DevOps Handbook [Kim et al. 2016] are both useful resources.

Your deploy script must be 100% automated. You’ll be deploying every time you integrate, which will be multiple times per day, and could even be multiple times per hour. Manual steps introduce delays and errors.

Detecting Deployment Failures

Your monitoring system should alert you if a deployment fails. At a minimum, this involves monitoring for an increase in errors or a decrease in performance, but you can also look at business metrics such as user sign-up rates. Be sure to program your deploy script to detect errors, too, such as network failures during deploy. When the deploy is complete, have your deploy script tag the deployed commit with “success” or “failure.”

To reduce the impact of failure, you can deploy to a subset of servers, called canary servers, and automatically compare metrics from the old deploy to the new deploy. If they’re substantially different, raise an alert and stop the deploy. For systems with a lot of production servers, you can also have multiple waves of canary servers. For example, you could start by deploying to 10% of servers, then 50%, and finally all.

Resolving Deployment Failures

One of the advantages of continuous deployment is that it reduces the risk of deployment. Because each deploy only represents a few hours of work, they tend to be low impact. If something does go wrong, the change can be reverted without affecting the rest of the system.

When a deployment does go wrong, immediately “stop the line” and focus the entire team on fixing the issue. Typically, this will involve rolling back the deploy.

Roll back the deploy

Start by restoring the system to its previous, working state. This typically involves a rollback, which involves restoring the previous deploy’s code and configuration. To do so, you can keep each deploy in a version control system, or you can just keep a copy of the most recent deploy.

One of the simplest ways to enable rollback is to use blue/green deployment. To do so, create two copies of your production environment, arbitrarily labelled “blue” and “green,” and configure your system to route traffic to one of the two environments. Each deploy toggles back and forth between the two environments, allowing you to roll back by routing traffic to the previous environment.

For example, if “blue” is active, deploy to “green.” When the deploy is complete, stop routing traffic to “blue” and route it to “green” instead. If the deploy fails, rolling back is a simple matter of routing traffic back to “blue.”

Occasionally, the rollback will fail. This may indicate a data corruption issue or a configuration problem. Either way, it’s all hands on deck until the problem is solved. Site Reliability Engineering [Beyer et al. 2016] has practical guidance about how to respond to such incidents in chapters 12-14.

Fix the deploy

Rolling back the bad deploy will usually solve the immediate production problem, but your team isn’t done yet. You need to fix the underlying issue. The first step is to get your integration branch back into a known-good state. You’re not trying to fix the issue, yet, you’re just trying to get your code and production environment back into sync.

Start by reverting the changes in the code repository, so your integration branch matches what’s actually in production. If you use merge commits in git, you can just run git revert on the integration commit. Then use your normal continuous integration process to integrate and deploy the reverted code.

Deploying the reverted code should proceed without incident, because you’re deploying the same code that’s already running. It’s important to do so anyway, because it ensures your next deploy starts from a known-good state. Also, if this second deploy also has problems, it narrows the issue down to a deployment problem, not a problem in your code.

Ally
Incident Analysis

Once you’re back in a known-good state, you can fix the underlying mistake. Create tasks for debugging the problem—usually, the people who deployed it will fix it—and everybody can go back to working normally. After it’s been resolved, schedule an incident analysis session to learn how to prevent this sort of deployment failure from happening in the future.

Alternative: Fix forward

Some teams, rather than rolling back, fix forward. They make a quick fix—possibly by running git revert—and deploy again. The advantage of this approach is that you fix problems using your normal deployment script. Rollback scripts can go out of date, causing them to fail just when you need them the most.

On the other hand, deploy scripts tend to be slow, even if you have an option to disable testing (which isn’t necessarily a good idea). A well-executed rollback script can complete in a few seconds. Fixing forward can take a few minutes. During an outage, those seconds count. For this reason, I tend to prefer rolling back, despite the disadvantages.

Incremental Releases

Ally
Feature Flags

For large or risky changes, run the code in production before you reveal it to users. This is similar to a feature flag, except that you’ll actually exercise the new code. (Feature flags typically prevent the hidden code from running at all.) For additional safety, you can release the feature gradually, enabling a subset of users at a time.

The DevOps Handbook [Kim et al. 2016] calls this a dark launch. Chapter 12 has an example of Facebook using this approach to release Facebook Chat. The chat code was loaded onto clients and programmed to send invisible test messages to the back-end service, allowing Facebook to load-test the code before rolling it out to customers.

Data Migration

Database changes can’t be rolled back—at least, not without risking data loss—so data migration requires special care. It’s similar to performing an incremental release: first you deploy, then you migrate. There are three steps:

  1. Deploy code which understands both the new and old schema. Deploy the data migration code at the same time.

  2. After the deploy is successful, run the data migration code. It can be started manually, or automatically as part of your deploy script.

  3. When the migration is complete, manually remove the code that understands old schema, then deploy again.

Separating data migration from deployment allows each deploy to fail, and be rolled back, without losing any data. The migration only occurs after the new code has proven to be stable in production. It’s slightly more complicated than migrating data during deployment, but it’s safer, and it allows you to deploy with zero downtime.

Migrations involving large amount of data require special care, because the production system needs to remain available while the data is migrating. For these sorts of migrations, write your migration code to work incrementally—possibly with a rate limiter, for performance reasons—and have it use both schema simultaneously. For example, if you’re moving data from one table to another, your code might look at both tables when reading and updating data, but only insert data into the new table.

Allies
Task Planning
Visual Planning

After the migration is complete, be sure to keep your code clean by removing the outdated code. If the migration needs more than a few minutes, add a reminder to your team’s task plan. For very long migrations, you can add a reminder to your team calendar or schedule a “finish data migration” story into your team’s visual plan.

This three-step migration process applies to any change to external state. In addition to databases, it also includes configuration settings, infrastructure changes, and third-party service changes. Be very careful when external state is involved, because errors are difficult to undo. Smaller, more frequent changes are typically better than big, infrequent changes.

Prerequisites

Allies
Zero Friction
Continuous Integration
Feature Flags
No Bugs
Build for Operation

To use continuous deployment, your team needs a rigorous approach to continuous integration. You need to integrate multiple times per day and create a known-good, deploy-ready build each time. “Deploy-ready,” in this case, means unfinished features are hidden from users and your code doesn’t need manual testing. Finally, your deploy process needs to be completely automated, and you need a way of automatically detecting deployment failures.

Continuous deployment only makes sense when deployments are invisible to users. Practically speaking, that typically means back-end systems and web-based front-ends. Desktop and mobile front-ends, embedded systems, and so forth usually aren’t a good fit for continuous deployment.

Indicators

When your team deploys continuously:

  • Deploying to production becomes a stress-free non-event.

  • When deployment problems occur, they’re easily resolved.

  • Deploys are unlikely to cause production issues, and when they do, they’re usually quick to fix.

Alternatives and Experiments

The typical alternative to continuous deployment is release-oriented deployment: only deploying when you have something ready to release. Continuous deployment is actually safer and more reliable, once the preconditions are in place, even though it sounds scarier at first.

You don’t have to switch from release-oriented deployment directly to continuous deployment. You can take it slowly, starting out by writing a fully-automated deploy script, then automatically deploying to a staging environment as part of continuous integration, and finally moving to continuous deployment.

In terms of experimentation, the core ideas of continuous deployment are to minimize work in progress and speed up the feedback loop (see “Key Idea: Minimize Work in Progress” on page XX and “Key Idea: Fast Feedback” on page XX). Anything that you can do to speed up that feedback loop and decrease the time required to deploy is moving in the right direction. For extra points, look for ways to speed up the feedback loop for release ideas, too.

Further Reading

The DevOps Handbook [Kim et al. 2016] is a thorough look at all aspects of DevOps, including continuous deployment, with a wealth of case studies and real-world examples.

XXX Continuous Delivery

XXX Avi Kessner recommends: https://dzone.com/articles/safe-database-migration-pattern-without-downtime (regarding large database migration) (look for original)

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.