Large-Scale Agile: Where Do You Want Your Complexity?
January 18, 2019
One of the pernicious problems in large-scale software development is cross-team coordination. Most large-scale Agile methods focus on product and portfolio coordination, but there's a harder problem out there: coordinating the engineering work.
Poor engineering coordination leads to major problems: bugs, delays, production outages. Cross-team inefficiencies creep in, leading to Ron Jeffries' description: "a hundred-person project is a ten-person project, with overhead." One of my large-scale clients had teams that were taking three months to deliver five days of work—all due to cross-team coordination challenges.
How do you prevent these problems? One of the key ideas is to isolate your teams: to carefully arrange responsibilities so they don't need to coordinate so much. But even then, as Michael Feathers says, there's a law of conservation of complexity in software. We can move the complexity around, but we can't eliminate it entirely.
So where should your complexity live?
Monolith: Design Complexity
In the beginning, applications were monoliths. A monolith is a single program that runs in a single process, perhaps replicated across multiple computers, maybe talking to a back-end database or two. Different teams can be assigned to work on different parts of the program. Nice and simple.
Too simple. A monolith encourages bad habits. If two parts of the program need to communicate, sometimes the easiest way is to create a global variable or a singleton. If you need a piece of data, sometimes the easiest way is to duplicate the SQL query. These shortcuts introduce coupling that make the code harder to understand and change.
To be successful with a monolith, you need careful design to make sure different parts of the program are isolated. The more teams you have, the more difficult this discipline is to maintain. Monoliths don't provide any guardrails to prevent well-meaning programmers from crossing the line.
Microservices: Ops Complexity
At first glance, microservices seem like the perfect solution to the promiscuous chaos of the monolith. Each microservice is a small, self-contained program. Each is developed completely independently, with its own repository, database, and tooling. Services are deployed and run separately, so it's literally impossible for one team to inappropriately touch the implementation details of another.
I see it so often, I’ve given it a name: “Angry Ops Syndrome.”
Unfortunately, microservices move the coordination burden from development to operations. Now, instead of deploying and monitoring a single application, ops has to deploy and monitor dozens or even hundreds of services. Applications often can't be tested on developers' machines, requiring additional ops effort to create test environments. And when a service's behavior changes, other services that are affected by those changes have to be carefully orchestrated to ensure that they're deployed in the right order.
The microservice ops burden is often underestimated. I see it so often, I've given it a name: "Angry Ops Syndrome." As dev grows, the ops burden multiplies, but ops hiring doesn't keep pace. Problems pile up and a firefighting mentality takes over, leaving no time for systemic improvements. Sleepless nights, bitterness, and burnout result, leading to infighting between ops and dev.
Microservices also introduce complexity for programmers. Because of the stringent isolation, it's difficult to refactor across service boundaries. This can result in janky cross-service division of responsibilities. Programmers also need to be careful about dealing with network latency and errors. It's easy to forget, leading to production failures and more work for ops.
To be successful with microservices, you need well-staffed ops, a culture of dev-ops collaboration, and tooling that allows you to coordinate all your disparate services. Because refactoring across service boundaries is so difficult, Martin Fowler also suggests you start with a monolith so you can get your division of responsibilities correct.
Nanoservices: Versioning Complexity
What's a nanoservice? It's just like a microservice, but without all the networking complexity and overhead. In other words, it's a library.
Joking aside, the key innovation in microservices isn't the network layer, which introduces all that undesired complexity, but the idea of small, purpose-specific databases1. Instead of microservices, you can create libraries that each connect to their own database, just like a microservice. (Or even a separate set of tables in a single master database. The key here is separate, though.) And each library can have its own repository and tooling.
1Thanks to Matteo Vaccari for this insight.
Using a library is just a matter of installing it and making a function call. No network or deployment issues to worry about.
But now you have versioning hell. When you update your library, how do make sure everybody installs the new version? It's particularly bad when you want to update your database schema. You either have to maintain multiple versions of your database or wait for 100% adoption before deploying the updates.
To be successful with libraries, you need a way to ensure that new versions are consumed quickly and the software that uses those versions is deployed promptly. Without it, your ability to make changes will stall.
Monorepo: Tooling Complexity
Maybe total isolation isn't the answer. Some large companies, including Google and Facebook, take a hybrid approach. They keep all their code in a single repository, like a monolith, but they divide the repository into multiple projects that can be built and deployed independently. When you update your project, you can directly make changes to its consumers, but the people who are affected get to sign off on your changes.
The problem with this approach is that it requires custom tooling. Google built their own version control system. Facebook patched Mercurial. Microsoft built a virtual filesystem for git.
You also need a way of building software and resolving cross-team dependencies. Google's solution is called Blaze, and they've open-sourced a version of it called Bazel. Additional tooling is needed for cross-team ownership and sign-offs.
These tools are starting to enter the mainstream, but they aren't there yet. Until then, to be successful with a monorepo, you need to devote extra time to tooling.
Which Way is Best?
I'm partial to the monorepo approach. Of all the options, it seems to have the best ability to actually reduce coordination costs (via tooling) rather than just sliding the costs around to different parts of the organization. If I were starting from scratch, I would start with a monorepo and scale my tooling support for it along with the rest of the organization. In a large organization, a percentage of development teams should be devoted to enabling other developers, and one of them can be responsible for monorepo tooling.
But you probably don't have the luxury of starting from scratch. In that case, it's a matter of choosing your poison. What is your organization best at? If you have a great ops team and embedded devops, microservices could work very well for you. On the other hand, if you aren't great at ops, but your programmers are great at keeping their designs clean, a monolith could work well.
As always, engineering is a matter of trade-offs. Which one is best? Wherever you want your complexity to live.
(Thanks to Michael Feathers for reviewing an early draft of this essay.)