AoAD2 Practice: Evolutionary System Architecture

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Evolutionary System Architecture

Audience
Programmers, Operations

We build our infrastructure for what we need today, without sacrificing tomorrow.

Allies
Simple Design
Incremental Design
Reflective Design

Simplicity is a key Agile idea, as discussed in “Key Idea: Simplicity” on p.XX. It’s particularly apparent in the way fluent Delivering teams approach evolutionary design: they start with the simplest possible design, layer on more capabilities using incremental design, and constantly refine and improve their code using reflective design.

What about your system architecture? By system architecture, I mean all the components that make up your deployed system. The applications and services built by your team, and the way they interact. Your network gateways and load balancers. Even third-party services. What about them? Can you apply the ideas of evolutionary design to such a complex environment?

You can. When you do, you get evolutionary system architecture.

Start With Simplicity

The software industry is awash with stories of big companies solving big problems. Google has a database that is replicated around the world! Netflix shut down their data centers and moved everything to the cloud! Amazon mandated that every team publish their services, and by doing so, created the billion-dollar Amazon Web Services business!

It’s tempting to imitate these success stories, but the problems those companies were solving are probably not the problems you need to solve. Until you’re the size of Google, or Netflix, or Amazon... YAGNI. You aren’t gonna need it.

Consider Stack Exchange, the company behind Stack Overflow, the popular programming Q&A site. They serve 1.3 billion pages per month, rendering each one in less than 19 milliseconds. How do they do it?1

1Stack Overflow publishes their system architecture and performance stats at https://stackexchange.com/performance, and Nick Craver has an in-depth series discussing their architecture at [Craver 2016]. The quoted data was accessed on May 4th, 2021.

  • 2 HAProxy load balancers, one live and one failover, peaking at 4,500 requests per second and 18% CPU utilization;

  • 9 IIS web servers, peaking at 450 requests per second and 12% CPU utilization;

  • 2 Redis caches, one master and one replica, peaking at 60,000 operations per second and 2% CPU utilization;

  • 2 SQL Server database servers for Stack Overflow, one live and one standby, peaking at 11,000 queries per second and 15% CPU utilization;

  • 2 more SQL Server servers for the other Stack Exchange sites, one live and one standby, peaking at 12,800 queries per second and 14% CPU utilization;

  • 3 tag engine and ElasticSearch servers, averaging 3,644 requests per minute and 3% CPU utilization for the custom tag engine, and 34 million searches per day and 7% utilization for ElasticSearch;

  • 1 SQLServer database server for HTTP request logging;

  • 6 LogStash servers for all other logs; and

  • Approximately the same thing again in a redundant data center (for disaster recovery).

As of 2016, they deployed the Stack Overflow site 5-10 times per day. The full deploy took about eight minutes. Other than localization and database migration, deploying was a matter of looping through the servers, taking each out of the HAProxy rotation, copying the files, and putting it back into rotation. Their primary application is a single multi-tenant monolith, which serves all Q&A websites. [Craver 2016]

This is a decidedly unfashionable approach to system architecture. There’s no Kubernetes, no Docker, no microservices, no autoscaling, not even any cloud. Just some beefy rack-mounted servers, a handful of applications, and file-copy deployment. It’s straightforward, robust, and it works.

Does your architecture need to be more complex than a top-50 website’s?

Stack Overflow is one of the 50 highest-trafficked web sites in the world.2 Is your architecture more complex than theirs? If so... does it need to be?

2Stack Overflow traffic ranking retrieved from alexa.com on May 6th, 2021.

I’m not saying you should blindly copy Stack Exchange’s architecture. Don’t blindly copy anyone! Instead, look at the problems you need to solve. (“A more impressive resume” doesn’t count.) What’s the simplest architecture that will solve them? Start with what you need and grow from there.

Microservices and Monoliths

Microservices are the most common reason I see for complex system architectures. Should you use them? Rather than copying someone else’s answer, think about the problems microservices solve, and whether they apply to your situation.

Just in case you aren’t familiar with them, a microservice architecture involves building your system out of small, loosely coupled programs (“microservices”) that communicate via network calls. In contrast, a monolithic architecture involves building your system out of a few large programs (“monoliths”) that perform as much communication as possible internally, via function and method calls.

Microservices are commonly touted as a solution to application complexity woes. Monoliths, the story goes, inevitably result in a big ball of mud. No matter how carefully you modularize your code, you can’t avoid the convenience afforded by having everything in the same process. That careful modularity will always break down, microservice proponents say. Microservices allow multiple teams to work independently without creating a big ball of mud.

Ally
Collective Code Ownership
Reflective Design

They’re not entirely wrong. Without collective code ownership and reflective design, designs do break down over time, and a lot of it is due to people “cheating” for convenience. Rather than fixing a design that doesn’t meet their needs, they’ll work around it. They’ll introduce a global variable, directly modify a database table, or add a special case to a vaguely-related function.

Microservices solve this problem by creating small, independent units. It’s impractical for one microservice to reach into the internals of another microservice, so the “cheating” goes away. Each microservice is so small, it’s hard for design to become a problem. Test suites are smaller, too, so builds are faster, which benefits continuous integration and deployment.

But microservices don’t reduce complexity. They just move it from application architecture to system architecture. In so doing, they multiply operational complexity.

Rather than having to deploy and manage a few processes, now you have to deploy and manage dozens, or even hundreds. Your application becomes a complex distributed system, one of the most challenging problems in computer science. Monitoring is harder. Error tracing is harder. Distributed transactions are a nightmare. Function calls morph from a single line of code to a behemoth of error handling, timeouts, retries, and backpressure... if you do it right. It’s all too easy to make a mistake, and those mistakes are often invisible until you’re in the middle of an outage.

What’s worse than a big ball of monolith mud? A big tangle of microservice worms.

Worse, microservices lock your architecture in place, because each microservice publishes an interface that needs to remain relatively stable. Changing the interface requires careful coordination with callers, who may not be easy to determine. As a result, your ability to refactor is limited. If you designed your microservice responsibilities poorly, or your application needs change, it’s difficult to fix. What’s worse than a big ball of monolith mud? A big tangle of microservice worms.

Microliths: An Evolvable Compromise

The fundamental challenge of microservices is that they move design decisions from the application to the system architecture, where they’re harder to change. At the same time, they solve a real problem. What if there was a simpler way to solve that problem?

There is. I call it a microlith architecture, because it combines the best parts of microservices and monoliths. (It’s also a real word, referring to a tiny stone tool chipped off a larger stone, which almost works as a metaphor.)

The problem with monoliths is that they make it easy to “cheat” and poke into parts of the system you’re not supposed to. They also become harder to deploy and scale as they get bigger, although that isn’t inevitable, as the Stack Exchange example shows.

Microlith architecture solves this problem by structuring a monolith into individual “microliths.” Each microlith is designed like a microservice: each has a single, well-defined API which only accepts and returns primitive values; each is responsible for managing its own database; each has its own configuration and logging. Callers use wrappers (see “Third-Party Components” on p.XX) to encapsulate the call to the microlith, just like they would for a microservice. The “Microservice Architecture Without Microservice Overhead” episode of [Shore 2020b] has an example.

Unlike a microservice, though, microliths share a single codebase, just like a monolith. They run in a single process. Rather than communicating via network calls, microliths communicate via straightforward function calls. To prevent “cheating,” they’re only allowed to interact with other microliths via their API. (If you’re concerned that people won’t follow this rule, you can write a simple static analysis tool to check files’ imports—it doesn’t take much more than a regex—or use language features such as C#’s internal keyword.)

Ally
Incremental Design

Because microliths are all part of the same codebase and application, they’re easy to refactor. You can use incremental design to evolve microlith responsibilities. As those responsibilities stabilize and the need for independent deployment grows, you can split individual microliths off into real microservices.

Converting a microlith to a microservice, and back, is fairly easy. Because each microlith is already isolated from the rest of its application, it’s just a matter of introducing the network call. Replace the API module with a server and modify callers’ microlith wrappers to use network requests instead of function calls. This will require introducing error handling, timeouts, retries, exponential back off, and backpressure, as well as infrastructure changes... but that’s the cost of a microservice, and starting with a microlith allows you to delay that cost until it’s really needed.

If you wanted to, you could force your microliths to take microservice limitations into account from the beginning, by programming the microlith APIs to take a few milliseconds to respond, or even to randomly fail... but that seems like overkill to me.

Microservices are often used to define boundaries between teams who don’t share code ownership. Monoliths work for this purpose, too. Each team takes ownership of a microlith (or several), using the microlith API as the team boundary, just like they would with a microservice API. Because the microliths are all part of the same codebase, you can easily look up callers and perform system-wide refactorings as needed.

As the system grows, you can split the codebase into multiple parts (such as DLLs or JARs), while still keeping the benefits of simpler function-call APIs. Monorepo tools such as Bazel will help you manage builds and cross-team dependencies. And, of course, you can always split off microliths and convert them to microservices, or vice versa, whenever needed.

Ally
Incremental Design

You don’t have to start with a microlith architecture, though. Remember: start simple, and adopt solutions based on the problems you actually have. Microservices and microliths are strategies for large systems and multiple teams. An even simpler approach would be to start with a normal monolith, then factor out individual microliths as the application grows.

Questions

Our microservices have turned into a big tangle of worms. How can we evolve to a better architecture?

One option is to simplify your system by converting the most tightly-coupled microservices into microliths, combining them into a single codebase. You can use routing code to keep the existing APIs unchanged, at first.

Once you’ve done so, gradually combine the microliths that involve distributed transactions or other complex interactions, then look for better ways to divide their responsibilities. When you find those responsibilities, factor them out into new modules and, after they’ve been proven to work well in isolation, back into new, independent microliths. You can then convert the new microliths back into microservices when needed.

Do this work incrementally, as with any architectural change. See “Application Architecture” on p.XX.

Prerequisites

Evolutionary system architecture depends on a close relationship between developers and operations. They need to work together to identify the “simplest thing that could possibly work” for the system’s current needs, including peak loads and projected increases, and they need to continue to coordinate as needs change.

It also depends on people being willing to choose “boring” technologies rather than focusing on the latest and greatest.

Be sure to remember that “simple” doesn’t mean “simplistic.” Stack Exchange has a simple architecture, but it’s still very rigorous, with just 75 minutes of downtime in 2020.3 Nick Craver described the ground rules in [Craver 2016]:

3Stack Overflow uptime based on https://uptime.com/stackoverflow.com. Accessed May 12, 2021.

  • Everything is redundant.

  • All servers and network gear have at least 2x 10Gbps connectivity.

  • All servers have 2 power feeds via 2 power supplies from 2 UPS units backed by 2 generators and 2 utility feeds.

  • All servers have a redundant partner between rack A and B.

  • All servers and services are doubly redundant via another data center (Colorado), though I’m mostly talking about New York here [in this article].

  • Everything is redundant.

Nick Craver, Stack Overflow Architecture Lead

Indicators

When you evolve your system architecture well:

  • Small systems have small architectures. Large systems have manageable architectures.

  • The system architecture is easy to explain and understand.

  • Multiple team members are capable of maintaining and updating the team’s deployment scripts.

Alternatives and Experiments

Some teams simplify their architecture by using third-party services. There’s nothing wrong with that, if it solves the problems your team is facing. For example, an organization that doesn’t want to manage data center hardware can use a cloud-based infrastructure-as-a-service (IaaS) solution, such as AWS or Azure. An organization that wants to minimize their operational burden can use a platform-as-a-service (PaaS) such as Heroku.

Don’t confuse “easy” with “simple,” though. Some services make it very easy to get up and running, but do so at the cost of greater complexity as you get into the details. For example, “serverless”—short-lived servers that run code on demand—is great for bursty loads, but can be hard to test locally and makes managing latency more difficult.

As you experiment with ways to simplify your system architecture, focus on things that reduce developer friction and make troubleshooting easier, without sacrificing reliability and maintainability.

Further Reading

XXX Recommendations?

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

If you liked this entry, check out my best writing and presentations, and consider subscribing to updates by email or RSS.