AoAD2 Practice: Evolutionary System Architecture

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Revised: July 18, 2021

Evolutionary System Architecture

Audience
Programmers, Operations

We build our infrastructure for what we need today, without sacrificing tomorrow.

Allies
Simple Design
Incremental Design
Reflective Design

Simplicity is at the heart of Agile, as discussed in “Key Idea: Simplicity” on page XX. It’s particularly apparent in the way fluent Delivering teams approach evolutionary design: they start with the simplest possible design, layer on more capabilities using incremental design, and constantly refine and improve their code using reflective design.

What about your system architecture? By system architecture, I mean the components that make up your deployed system. The applications and services built by your team, and the way they interact. Your network gateways and load balancers. Even third-party services. What about them? Can you start simple and evolve from there?

That’s evolutionary system architecture, and I’ve seen it work on small systems. But system architectures are slow to evolve, so there isn’t the same depth of industry experience behind evolutionary system architecture that there is behind evolutionary design. Use your own judgement about how and when it should be applied.

I make a distinction between system architecture and application architecture. Application architecture is the design of your code, including decisions about how to call other components in your system. It’s discussed in “Application Architecture” on page XX. This practice discusses system architecture: decisions about which components to create and use, and the high-level relationships between them.

Are You Really Gonna Need It?

The software industry is awash with stories of big companies solving big problems. Google has a database that is replicated around the world! Netflix shut down their data centers and moved everything to the cloud! Amazon mandated that every team publish their services, and by doing so, created the billion-dollar Amazon Web Services business!

It’s tempting to imitate these success stories, but the problems those companies were solving are probably not the problems you need to solve. Until you’re the size of Google, or Netflix, or Amazon... YAGNI. You aren’t gonna need it.

Consider Stack Overflow, the popular programming Q&A site. They serve 1.3 billion pages per month, rendering each one in less than 19 milliseconds. How do they do it?1

1Stack Overflow publishes their system architecture and performance stats at https://stackexchange.com/performance, and Nick Craver has an in-depth series discussing their architecture at [Craver 2016]. The quoted data was accessed on May 4th, 2021.

  • 2 HAProxy load balancers, one live and one failover, peaking at 4,500 requests per second and 18% CPU utilization;

  • 9 IIS web servers, peaking at 450 requests per second and 12% CPU utilization;

  • 2 Redis caches, one master and one replica, peaking at 60,000 operations per second and 2% CPU utilization;

  • 2 SQL Server database servers for Stack Overflow, one live and one standby, peaking at 11,000 queries per second and 15% CPU utilization;

  • 2 more SQL Server servers for the other Stack Exchange sites, one live and one standby, peaking at 12,800 queries per second and 14% CPU utilization;

  • 3 tag engine and ElasticSearch servers, averaging 3,644 requests per minute and 3% CPU utilization for the custom tag engine, and 34 million searches per day and 7% utilization for ElasticSearch;

  • 1 SQL Server database server for HTTP request logging;

  • 6 LogStash servers for all other logs; and

  • Approximately the same thing again in a redundant data center (for disaster recovery).

As of 2016, they deployed the Stack Overflow site 5-10 times per day. The full deploy took about eight minutes. Other than localization and database migration, deploying was a matter of looping through the servers, taking each out of the HAProxy rotation, copying the files, and putting it back into rotation. Their primary application is a single multi-tenant monolith which serves all Q&A websites.

This is a decidedly unfashionable approach to system architecture. There’s no containers, no microservices, no autoscaling, not even any cloud. Just some beefy rack-mounted servers, a handful of applications, and file-copy deployment. It’s straightforward, robust, and it works.

One of the common reasons people provide for complex architecture is “scaling.” But Stack Overflow is one of the 50 highest-trafficked web sites in the world.2 Is your architecture more complex than theirs? If so... does it need to be?

2Stack Overflow traffic ranking retrieved from alexa.com on May 6th, 2021.

Aim for Simplicity

I’m not saying you should blindly copy Stack Overflow’s architecture. Don’t blindly copy anyone! Instead, look at the problems you need to solve. (“A more impressive resume” doesn’t count.) What’s the simplest architecture that will solve them?

One way to approach this question is to start with an idealized view of the world. You can use this thought experiment for existing architectures as well as new ones.

1. Start with an ideal world

Imagine your components are coded magically and perfectly, but not instantly. The network is completely reliable, but network latency still exists. Every node has unlimited resources. Where would you put your component boundaries?

You might need to segregate components into separate servers or geographies for security, regulatory, and latency reasons. You’re likely to make a distinction between client-side processing and server-side processing. You’ll still save time and effort by using third-party components.

2. Introduce imperfect components and networks

Now remove the assumption of perfect components and networks. Components fail; networks go down. Now you need redundancy, which necessitates components for handling replication and failover. What’s the simplest way you can meet those needs? Can you reduce complexity by using a third-party tool or service? For example, Stack Overflow has to worry about redundant power supplies and generators. If you use a cloud provider, that’s their problem, not yours.

3. Limit resources

Next, remove the assumption of unlimited resources. You might need multiple nodes to handle load, along with components for load balancing. You might need to split a CPU-intensive operation out into its own component, and introduce a queue to feed it. You might need a shared cache, and a way to populate it.

Are you speculating about future load, or addressing real issues?

But be careful: are you speculating about future load, or addressing real issues based on real-world usage and trends? Can you simplify your architecture by using more capable hardware, or by waiting to address future loads?

4. Consider humans and teams

Finally, take out the idealized coding. Who will be coding each component? How will they coordinate with each other? Do you need to split components to make cross-team communication easier, or to limit the complexity of any one component? Think about how you can simplify these constraints, too.

Controlling Complexity

Some architectural complexity is necessary. Although your system might be simpler if you didn’t have to worry about load balancing or component failure, you do have to worry about those things. As Fred Brooks said in his famous essay, “No Silver Bullet: Essence and Accident in Software Engineering” [Brooks 1995], some complexity is essential. It can’t be eliminated.

But other complexity is accidental. Sometimes, you split a large component into multiple small components just to make the human side easier, not because it’s an essential part of the problem you’re solving. Accidental complexity can be removed, or at least reduced.

Evolutionary design

One of the most common reasons I see for splitting components is to prevent “big balls of mud.” Small components are simple and easy to maintain.

Small components tend to increase overall complexity.

Unfortunately, this doesn’t reduce complexity; it just moves it from application architecture to system architecture. In fact, splitting a large component into multiple small components tends to increase overall complexity. It makes individual components easier to understand, but cross-component interactions are worse. Error tracing is harder. Refactoring is harder. And distributed transactions... well, they’re best avoided entirely.

Allies
Simple Design
Incremental Design
Reflective Design

You can reduce the need to split a component by using evolutionary design. It allows you to create large components that aren’t big balls of mud.

Self-discipline

Another reason teams split their components is to provide isolation. When a component is responsible for multiple types of data, it’s tempting to tangle the data together, making it difficult to refactor later.

Of course, there’s no inherent reason data has to be tangled together. It’s just a design decision, and if you can design isolated components, you can design isolated modules within a single component. You can even have each module use a separate database. It’s not like network calls magically create good design!

Allies
Collective Code Ownership
Pair Programming
Mob Programming
Energized Work
Reflective Design

But network calls do enforce isolation. If you don’t use the network to enforce isolation, you need a team with self-discipline instead. Collective code ownership, pairing or mobbing, and energized work all help, and reflective design allows you to fix any mistakes that slip through.

Fast deployment

Large components are often difficult to deploy. In my experience, it’s not the deployment itself that’s difficult, but the build and tests that have to run before the component is deployed. This is doubly true if the component has to be tested manually.

Ally
Zero Friction
Test-Driven Development
Continuous Integration
Fast Reliable Tests

Address this problem by creating a zero friction build, introducing test-driven development and continuous integration, and creating fast, reliable tests. If your build and tests are fast, you don’t have to split a component just to make deployment easier.

Vertical scaling

To paraphrase Conway’s Law, organizations tend to ship their org charts. Many organizations default to scaling horizontally (see chapter “Scaling Agility”), which results in lots of small, isolated teams. They need correspondingly small components.

Vertical scaling enables your teams to work together on the same components. It gives you the ability to design your architecture to match the problem you’re solving, rather than designing your architecture to match your teams.

Refactoring System Architecture

I have a friend who works at a large, well-known company. Due to a top-down architectural mandate, his team of three programmers maintains 21 separate services—one for each entity they control. Twenty-one! We took some time to think about his team’s code could be simplified.

  • Originally, his team was required to keep each service in a separate git repository. They got permission to combine the services into a single monorepo. That allowed them to eliminate duplicated serialization/deserialization code and dramatically ease refactorings. Previously, a single change could result in 16 separate commits across 16 repositories. Now, it only takes one.

  • With a few exceptions, the CPU requirements of his team’s services are minimal. Thanks to a organization-wide service locator, the services could be combined into a single component without changing their endpoints. This would allow them to deploy to fewer VMs, lowering their cloud costs; replace network calls with function calls, speeding up response times; and simplify their end-to-end tests, making deployments easier and faster.

  • About half of his team’s services are only used within his team. Each service has a certain amount of boilerplate and overhead. That overhead could be eliminated if the internal services were turned into libraries. It would also eliminate a bunch of slow end-to-end tests.

All in all, his team could remove a lot of costs and development friction by simplifying their system architecture, if they could get permission to do it.

I can imagine several of these sorts of system-level refactorings. Unfortunately, they don’t yet have the rich history the rest of the ideas in this book do. The “microlith” refactorings are particularly unproven. So I’ll just provide brief sketches, without much detail. Treat them as a set of ideas to consider, not a cookbook to follow.

Multi-repo Components → Monorepo Components

If your team‘s components are located in several repositories, you can combine them into a single repository, and factor out shared code for common types and utilities.

Components → Microliths

If your team owns multiple components, you can combine them into a single component while keeping the basic architecture the same. Isolate them in separate directory trees and use a top-level interface file, rather than a server, to translate between your serialized payload and the component’s data structures. Replace the network calls between components with a function call, but keep the architecture the same in every other way, including the use of primitive data types rather than objects or custom types.

I call these in-process components microliths.3 You can see an example of this refactoring in episode 21 of [Shore 2020b]. They provide the isolation of a component without the operational complexity.

3I call them microliths because I originally envisioned them as a combination of the best parts of monoliths and microservices. “Microlith” is also a real word, referring to a tiny stone tool chipped off a larger stone, which almost works as a metaphor.

My microlith refactorings are the most speculative. I’ve only tried them on toy problems. I’m including them because they provide an intermediate step between components and modules.

Microliths → Modules

Microliths are strongly isolated. They’re effectively components running in a single process. That introduces some complexity and overhead.

If you don’t need such strong isolation, you can remove the top-level interface file and serialization/deserialization. Just call the microlith’s code normally. The result is a module. (Not to be confused with a source code file, which can also be called a module.)

A component composed of modules is typically called a modular monolith, but modules aren’t just for monoliths. You can use them in any component, no matter how big or small.

Modules → New Modules

If your modules have a lot of cross-module dependencies, you might be able to simplify them by refactoring their responsibilities. This is really a question of application architecture, not system architecture (see “Application Architecture” on page XX for more about evolving application architecture), but I’m including it because it can be an intermediate step in a larger system refactoring.

Big Ball of Mud → Modules
Allies
Incremental Design
Reflective Design

If you have a large component that’s turned into a mess, you can use evolutionary design to gradually convert it to modules, disentangling and isolating data as you go. Praful Todkar has a good example of doing so in [Todkar 2018]. This is also a matter of application architecture, not system architecture.

Modules → Microliths

If you want strong isolation, or you think you might want to split a large component into multiple small components, you can convert a module into a microlith. To do so, introduce a top-level interface file and serialize complex function parameters.

Treat the microlith as if it were a separate component. Callers should only call it through the top-level interface file, and should abstract those calls behind an infrastructure wrapper, as described in “Third-Party Components” on page XX. The microlith’s code should be similarly isolated; other than the common types and utilities a component might use, it should only reference other components and microliths, and only through their top-level interfaces. You might need to refactor your module to be more component-like first.

Network calls are far slower and less reliable than function and method calls. Converting a module to a microlith won’t guarantee your microlith will work well as a networked component. In theory, you could confirm your microliths will work as proper networked components by introducing a 1-2ms delay in your top-level API, or even random failures. In practice, that sounds ridiculous, and I’ve yet to try it.

Microliths → Components

If a microlith is suitable for use as a networked component, converting it into a component is fairly straightforward. It’s a matter of converting the top-level API file into a server, and converting callers to use network calls. This is easiest if you remembered to isolate their calls behind an infrastructure wrapper.

Converting a microlith to a component will likely require callers to introduce error handling, timeouts, retries, exponential backoff, and backpressure, in addition to the operational and infrastructure changes required by the new component. It’s a lot of work, but that’s the cost of networking.

Modules → Components

Rather than using microliths, you can jump straight from a module to a component. Although this can be done by extracting the code, I often see people rewriting modules instead. This is a common strategy when refactoring a big ball of mud, because the modules’ code often isn’t worth keeping. [Todkar 2018] demonstrates this approach.

Monorepo Components → Multi-repo Components

If you have multiple components in the same repository, you can extract them into separate repositories. One reason to do so is if you’re moving ownership of a component to another team. You might need to duplicate common types and utilities.

Compound refactorings

You’ll typically string these system-level refactorings together. For example, the most common approach I see is to clean up legacy code by using “Big Ball of Mud → Modules” and “Modules → Components.” Or, more compactly: Big Ball of Mud → Modules → Components.

Combining components is a similar operation in reverse: Multi-repo Components → Monorepo Components → Microliths → Modules.

If you have a bunch of components with tangled responsibilities, you might be able to refactor to new responsibilities instead of rewriting: Components → Microliths → Modules → New Modules → Microliths → Components.

Prerequisites

Ally
Impediment Removal

You’ll likely only be able to use the ideas in this practice with the components your team owns. Architectural standards and components owned by other teams are likely to be out of your direct control, but you might be able to influence people to make the changes you need.

Changes to system architecture depend on a close relationship between developers and operations. You’ll need to work together to identify a simple architecture for the system’s current needs, including peak loads and projected increases, and you’ll need to continue to coordinate as needs change.

Indicators

When you evolve your system architecture well:

  • Small systems have small architectures. Large systems have manageable architectures.

  • The system architecture is easy to explain and understand.

  • Accidental complexity is kept to a minimum.

Alternatives and Experiments

Allies
Simple Design
Incremental Design
Reflective Design

Many of the things people think of as “evolutionary system architecture” are actually just normal evolutionary design. For example, migrating a component from using one database to another is an evolutionary design problem, because it’s mainly about the design of a single component. The same is true for migrating a component from using one third-party service to another. Those sorts of changes are covered by the evolutionary design practices: simple design, incremental design, and reflective design.

Evolving system architecture means deliberately starting with simplest system possible and growing as your needs change. It’s an idea that has yet to be explored fully. Pick and choose the parts of this practice that work for your situation, then see how far you can push it. The underlying goal is to reduce developer and operations friction and make troubleshooting easier, without sacrificing reliability and maintainability.

Further Reading

Building Evolutionary Architectures [Ford et al. 2017] goes into much more detail about architectural options. It takes an architect-level view rather than the team-level view I’ve provided.

XXX Sarah Horan Van Treese recommends: Building Microservices by Sam Newman

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

If you liked this entry, check out my best writing and presentations, and consider subscribing to updates by email or RSS.