James Shore: Large-Scale Agile

Large-Scale Agile

February 24, 2010

There's been interest in "scaling Agile" since the beginning of the movement, but most of the early discussion was theoretical. There wasn't anybody doing Agile on a large scale at the time. Ron Jeffries summed up my early feelings on the subject perfectly on the C2 Wiki: "A hundred-person project is a ten-person project, with overhead." I dabbled with some of the challenging questions, such as how evolutionary architecture works in a multi-team environment, but I never tried to address the whole picture.

That's been changing. Agile has matured beyond the pilot project stage, and more companies are interested in rolling it out organization-wide. There's a genuine need for large-scale Agile that's continuing to grow, and even my smaller, entrepreneurial clients are facing the issue.

So I'm taking a second look at large-scale Agile systems. This essay shares some of my latest ideas. I've tried a lot of these ideas, but not as complete package, and none of them have the solid years of practice required to turn supposition into reliable recommendation.

The Constraint

When does Agile go from "normal" to "large-scale?" In my mind, it's the moment when you go from one intact team--that is, a cross-functional, co-located team that has the resources necessary to finish the job--to multiple interdependent teams. As a practical matter, this happens once you need more than ten to twenty people on a project. You can fudge along with modified "single team" practices up until about fifty people, but it becomes increasingly unwieldy. By the time you reach 100 people, even modified "single team" approaches break down.

Why ten? According to my colleague Diana Larsen, the optimum size for a team is "seven plus or minus two." Once a team gets much larger than that, it starts breaking down into sub-teams. Even if they're all part of the same "team" on paper, the people form cliques and stop working as a single unit. You end up with multiple interdependent teams whether you admit it or not.

Now, you can certainly have multiple interdependent teams with less than ten people. This is the de-facto result of having a multi-location "team" unless you're very, very careful. But it's not optimal. Coordinating those teams adds a lot of overhead and opportunities for error. That's why I recommend creating intact teams when you can.

One last thing: there's no need for the cost and waste of large-scale Agile if your teams aren't interdependent. If you have five ten-person teams, each working on a separate product, with no shared code or other resources, then they're not interdependent. You can be quite successful with five normal Agile teams. It's once those teams start collaborating on a single product suite, or trying to re-use each others' code, or otherwise depend on each other to get the job done, that the need for large-scale Agile arises.

Collected Wisdom

There isn't a lot out there on large-scale Agile. There's a lot of interest in the subject, but no approach that's risen to prominence. The closest thing we have is Scrum's "Scrum of Scrums" idea, which is essentially a second (and possibly third) level of Daily Scrums which includes one representative from each team.

"Scrum of Scrums" seems fine for sharing status and engaging in light coordination, but in typical Scrum fashion it's silent on the tough details. It doesn't discuss how you address architectural issues, how you plan the work, or even how to divide work among teams. The folks I've talked to who have experience with Scrum of Scrums said it wasn't sufficient on its own.

Other than Scrum of Scrums, I haven't seen any consensus around large-scale Agile techniques. Most of the discussions get side-tracked by the politics and constraints of large organizations and end up focusing on how to water down Agile ideas to make them more palatable. A particularly common subset is "distributed Agile," often in regards to constructing a single team from people in multiple locations.

That's not really large-scale Agile. It's mediocre small-scale Agile repeated many times. Not what I'm looking for.

The Ideas

Maybe there are brilliant ideas out there that I've missed. If so, please let me know. In the absense of anything else, though, I've put together my own package of ideas. (Brilliance optional.) As usual when I modify an Agile method, these ideas are inspired by Agile values and principles more than existing practices. In this case, I found myself gravitating towards Lean principles, first introduced to the Agile community by the Poppendiecks in Lean Software Development.

I started with Kanban. Kanban's an up-and-comer in the Agile community, but the popular implementation of it bothers me. It's heavily phase-based and typically ignores what the Agile community has learned about simultaneous phases.

At the recent Agile Open Northwest conference, I realized my core objection: Kanban enthusiasts are using it to manage the work of the team, a problem that the Agile community has already solved. Instead, they should be thinking of the team as a Lean work cell, and use Kanban to solve the hard problem: managing the interaction between teams!

That realization was the seed for my large-scale Agile ideas. I filled it out with other ideas that have been germinating for years. Here's the whole package:

Create High-Performance Work Cells
Use Kanban to Manage Cross-Team Workflow
Manage the Portfolio With a Dedicated Team
Use Bounded Contexts to Minimize Dependencies
Monitor the System Using Lean Techniques
Sacrifice Reuse in Favor of Throughput
Keep Communication Flowing With Scrums of Scrums

1. Create High-Performance Work Cells

One of the key ideas of Lean is "one-piece flow." The goal is to take one car, or widget, or whatever entirely from start to finish without accumulating any inventory between steps. This has numerous benefits, including higher quality, higher throughput, and lower inventory. In practice, it's not possible to have the entire resources of a factory focused on just one widget. But that's the goal.

To this end, Lean has the concept of the work cell. In a work cell, an operator takes a piece of work from one machine to the next until it's been processed by all of the machines in the cell. This is quite a contrast to the traditional assembly line approach where each person operates just one machine all the time and inventory piles up between machines. Jeffrey Liker explains why:

When you link operations together in a one-piece flow, your entire cell goes down if any one piece of equipment fails. You sink or swim together as a unit. So why not have some inventory to make life more comfortable? Because whether it is a pile of material or a virtual pile of information waiting to be processed, inventory hides problems and inefficiencies. Inventory enables the bad habit of not having to confront your problems. If you don't confront your problems, you can't improve your process. One-piece flow and continuous improvement (kaizen) go hand in hand!

Jeffrey Liker, The Toyota Way

The Lean work cell maps almost perfectly to the Agile idea of the cross-functional, co-located team using simultaneous phases. Rather than using the phase-based approach of handing off responsibility from group to group (requirements, then design, then coding, then test), an Agile team takes responsibility for one or two features at a time and works on them as a whole until they're done.

So the foundation of my approach to large-scale Agile is the high-performance Agile team, or work cell. Rather than water down Agile to fit in a large organization, I expect the work cell to use the same rigorous Agile practices normal Agile teams do, and to produce the same high-quality, potentially-shippable results.

2. Use Kanban to Manage Cross-Team Workflow

Kanban has attracted a lot of attention lately in the Agile world, but I think the attention is misplaced. Kanban enthusiasts are using it to manage the work of the team, which ironically adds process steps and increases inventory over using the "work cell" approach I described above.

On the other hand, Kanban is a great way of coordinating work between multiple teams. I've found it to be a useful way of managing the backlog of a single Agile team. For large-scale Agile systems, the parallels with Lean are even stronger.

In this approach, "features" (or stories) are the unit of work. Each team gets a single backlog with a strict size limit that operates as their input buffer. When the team is ready to work on something new, they take the next item off of the backlog, work on it until it's done, and then deliver it to whichever team requested it. Teams should work on just a few features at a time--preferably, only one.

When there's a cross-team dependency, you can use feature cards to create a pull system. When a team gets a new feature, they take a look at it to see if the work has any dependencies on another team. If it does, they deliver the card to the other team. If the other team doesn't have room for the work, the card waits in the first team's backlog, taking up space, until the other team is ready for it.

For example, imagine a large-scale Agile system at Motokia, a fictional cell phone manufacturer. One of the teams in the system is the S42 Web Browser team. They're responsible for the web browser that's shipped with the Motokia S42 smartphone.

Assume the S42 Web Browser team receives a feature card that says, "Support tabbed browsing." In order to do that, they need the Mobile Sachromafox Web Engine team to implement support for multiple tabs. They take the card, walk over to the Mobile Sachromafox Web Engine team's shared workspace, and discuss it with them. The Web Engine team takes it and a week later (or whatever), the S42 Web Browser team gets the card back, along with a new build of the engine they need.

3. Manage the Portfolio with a Dedicated Team

At the very front (or back, depending on how you look at it) of the workflow is the Product Portfolio Team. This team is a work cell, too--it's a colocated, cross-functional, intact team consisting of product managers, business experts, UX designers, architects, developers, testers, and anyone else needed. Their job is to craft the big picture ideas (like "provide a high performance, compelling mobile web browser that blows away the competition") and make sure they come together smoothly. They're the ones that push the "ship it" button.

Product managers on the team come up with the big ideas and prioritize them.

Product managers, business experts, and UX designers break the big ideas down into specific features and parcel out those features to various teams. They also provide direction on how those features should work and how they come together into a compelling whole.

Developers and testers perform ongoing integration development and testing as various teams come back with their components.

Architects float from team to team providing big-picture guidance, sharing patterns and lessons learned, and looking for opportunities for re-use. They also keep an eye on the flow of work-in-progress and provide insight into staffing levels, bounded contexts, and how to divide responsibilities among teams.

To continue the example, let's assume that Motokia's large-scale Agile system is dedicated to smartphone development. They're responsible for the development of three phones: the Motokia S42, the Motokia B52, and the Motokia K9 (for when you really need to email your pet). At the front of the system's workflow is the Motokia Smartphone Portfolio Team.

The portfolio team decides that, in order to remain competitive, they need a high-performance, compelling mobile web browser across the entire smartphone line. They break this idea down into specific features, such as "pages display before they finish downloading," "tabbed browsing," "high-performance Javascript," "integration with email," and hand them off to the Motokia S42 team, the Motokia B52 team, and the Motokia K9 team. As the features come back from the various smartphone teams, the portfolio team confirms that they work consistently with each other and does any other integration work needed.

Meanwhile, the portfolio team's architects each go join the other teams in the large-scale Agile system for a few weeks at a time. They check back with each other frequently to share information about roadblocks, challenges, and insights, and take what they learn back to individual teams. They also discuss how responsibilities have been divided between the teams, such as the decision to have a shared Mobile Sachromafox Web Engine team, and keep an eye out for bottlenecks and opportunities for improvement in the overall system.

4. Use Bounded Contexts to Minimize Dependencies

The more dependencies between teams, the more hand-offs. The more hand-offs, the greater the opportunity for delay and error. These delays and errors are the reason large-scale Agile systems are so much more wasteful than individual Agile teams. Minimizing the number of dependencies is critical to reducing waste.

My favorite technique for managing dependencies between teams is the bounded context. The term comes from Eric Evans' excellent book, Domain-Driven Design. It's part of a discussion of cross-team dependencies hidden away in Chapter 14.

To put it simply, a bounded context is a collection of code, database schema, and other artifacts that are managed as a single unit. Within the bounded context, changes to one part of the system are allowed to affect any other part of the system. Any part of the bounded context may be refactored, enhanced, or changed at any time, and everybody working within a bounded context is expected to be aware of what everyone else in that context is doing.

Between bounded contexts, things are different. Communication can't be assumed, which means changes that affect other bounded contexts must be made carefully. To manage changes, bounded contexts might provide a published and versioned API, service level agreements, or formal specifications. Evans provides a good overview of the various ways bounded contexts can interact in Chapter 14 of Domain-Driven Design.

Bounded contexts allow large-scale Agile to retain the fluidity of normal Agile without running into an exponential communication problem as more people are added. Within bounded contexts, the normal Agile free-for-all takes place. Between bounded contexts, changes are made much more carefully and slowly.

So, where do you draw the boundaries? That's easy: bounded contexts correspond to Agile work cells--that is, teams. Every team gets a single bounded context. Sometimes a team might have several small bounded contexts, but you never share bounded contexts across teams. Agile teams have the pervasive communication necessary to manage a bounded context, and the work cell approach allows even a large team to share ownership of a single bounded context.

This does raise the question of how to partition the work among teams. That's the responsibility of the architects on the Product Porfolio Team, and it's a decision that will gradually evolve over time. As the architects consider this question, they should look for structures that have the fewest links between bounded contexts.

Mobile Sachromafox consists of three major components: the Javascript interpreter, the layout engine, and the network layer. It's too much work for a single team, so the work has been split into two bounded contexts, each with its own team. The Javascript Engine team is responsible for the Javascript interpreter and the Web Engine team is responsible for layout and networking.

As the Web Engine team does its work, they discover that their layout engine is too slow, and they engage in a huge refactoring effort to fix it. It ends up requiring changes to methods throughout their codebase.

Nonetheless, the Web Engine team makes those changes without worrying about their impact on other teams. They have a published API that they provide to their clients (such as the S42 Web Browser team), and the layout engine's changes didn't require any changes to the API. They do identify some changes that they would like from the Javascript interpreter, but they have to put those changes off because they're not part of their bounded context.

Clear boundaries allow the Web Engine team to make the rest of their changes quickly, secure in the knowledge that they have free reign to make the refactorings they need.

5. Monitor the System Using Lean Techniques

A large-scale Agile system with dozens of interacting teams is quite a beast. Keeping track of what's happening seems like it would be difficult at best.

Lean techniques seem like the best fit to manage the complexity. Visual control, measuring throughput, and value-stream mapping all seem like good ideas. I'm sure more techniques will emerge with practice.

Visual control means making progress and problems clearly visible. Fly a red flag above a team's shared workspace when they're blocked by another team. Monitor the number of days a team spends on a particular feature, and change the number from green to red once a certain threshold is reached. Use physical cards and whiteboards for team planning. Use kanban boards to show the flow of work between teams. And so on.

Throughput is my favorite productivity metric. It seems particularly apt for the complexity of large-scale Agile. It's also very easy to measure: just track how long it takes for an idea to go from concept to cash--from initial idea to revenue in the door. Improvements in throughput reflect improvements to the overall system.

Value-stream mapping is a useful technique for finding waste in a system and identifying opportunities to improve throughput. The Poppendiecks explain it well in their book Lean Software Development; in summary, you document the progress of a typical feature through the system, from concept to cash, keeping track of the amount of time spent adding value versus the amount of time wasted in waiting or rework.

Another Lean concept is the idea of takt time, which is the maximum time allowed to produce a unit in order for things to finish on schedule. In Lean systems, it becomes the heartbeat of the operation. Takt time seems like a good concept for large-scale Agile, but I'm not sure how to fit it in.

6. Sacrifice Reuse in Favor of Throughput

Large-scale Agile systems are likely to create more duplicate code than normal Agile teams, because there's so much code being created and it isn't possible for everyone to be aware of everything that's going on. Part of the role of the architects on the Product Portfolio Team is to identify when code needs to be made reusable.

Be careful, though. In The Mythical Man-Month, Fred Brooks estimated that making code reusable tripled its cost of development. Technology has changed a lot since Dr. Brooks wrote that article, but reuse still isn't free. In addition, the reusable software will form a new bounded context that will likely need its own team for ongoing enhancements and support. Users of the code won't be able to make changes when they need to. Instead, they'll have to make a hand-off to the other team and wait for the change, which increases delay and error. Even the team that originally developed the code will likely slow down, because the component will now be outside of their bounded context too.

These hand-offs mean that extracting a component for reuse by multiple teams can actually degrade throughput. As a result, the threshold for reuse in a large-scale Agile system should be much higher than it is on a normal Agile team. Track throughput and use this information to help guide, and even reverse, your decisions about when to reuse code and when to permit duplicate work.

7. Keep Communication Flowing with Scrums of Scrums

I haven't used a Scrum of Scrums myself, but I can see its value in keeping teams in touch with each other. In the large-scale Agile system I'm proposing, the Scrum of Scrums would play out a little differently than described in the Scrum literature. Rather than using a Scrum of Scrums coordinate all of the teams, I would ask teams to create a Scrum of Scrums among the teams that have dependencies on each other. A large-scale Agile system might end up with multiple Scrums of Scrums.

For example, the Mobile Sachromafox Web Engine team might form a Scrum of Scrums with the Javascript Engine team and the various web browser teams (the S42 Web Browser team, the B52 Web Browser team, and the K9 Web Browser team). Those teams form a web of dependencies and need to be aware of what each of the others are doing.

At the same time, other teams would form independent but overlapping Scrums of Scrums. The S42 Web Browser team could form a Scrum of Scrums with all of the other S42 application teams, and the S42 Smartphone team might form a Scrum of Scrums with the Product Portfolio Team and the other top-level smartphone teams.

Putting It Together

These ideas need further experimentation to see how they play out in practice. I've tried most of these ideas in one form or another on real projects, but this is the first time I've put them together as I describe here. It's my experience that every new process idea has hidden flaws that only get revealed when you put it into practice. I'm sure that's true of these ideas, too. I like to try my ideas in at least three different organizations, refining them over the course of several years, before I'm comfortable that they're as good as they look.

Unfortunately, large-scale Agile implementations are hard to come by. If you have one, I hope you'll give these ideas a try. And if you'd like me to come by and try them out with you, I'd love the opportunity. These theories need to be tempered in the crucible of experience. Let me know how they work for you.