This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!
This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.
- Programmers, Operations
When we’re ready to code, nothing gets in our way.
Imagine you’ve just started working with a new team. One of your new teammates, Pedro, walks you over to a development workstation.
“Since you’re new, we’ll start by deploying a small change,” he says, sitting down next to you. “This machine is brand new, so we’ll have to set it up from scratch. First, clone the repo.” He tells you the command. “Now, run the
Commands start scrolling up the screen. “We use a tool for reproducible builds,” Pedro explains. “It’s detected that you don’t have anything installed, so it’s installing the IDE, development tools, and images needed to develop and run the system locally.”
“This will take a while,” he continues. “After the first run, though, it’s instantaneous. It updates again only when we commit changes to the config. Come on, I’ll show you around the office.”
When you come back, the build is done. “Okay, let me show you the app,” Pedro says. “Type
rundev to start it up.” Once again, information starts scrolling by. “This is all running locally,” Pedro explains proudly. “We used to have a shared test environment, and we were constantly stepping on each other’s toes. Now that’s all in the past. It even knows which services to restart depending on which files you change.”
Pedro walks you through the application. “Now, let’s make a change. Run
watch quick. It will build and test the files we change.”
You follow his instructions and the script starts up, then immediately reports
BUILD OK in green. “Nothing’s changed since we last ran the build,” Pedro explains, “so the script didn’t do anything. Now, let’s make a small change.” He directs you to a test file and has you add a test. When you save the changes, the
watch script runs again and reports a test failure. It takes less than a second.
“We’ve put a lot of work into our build and test speed,” Pedro tells you. He’s clearly proud of it. “It wasn’t easy, but it’s totally worth it. We get feedback on most changes in a second or two. It’s done wonders for our ability to iterate and be productive. I’m not lying when I say this is the best development environment I’ve ever been in.”
“Now let’s finish up this change and deploy.” He shows you the production change needed to get the new test to pass. Once again, when you save, the
watch script runs the tests in about a second. This time, it reports success.
“Okay, we’re ready to deploy,” he says. “This is going into production, but don’t worry. The
deploy script will run the full test suite, and we also have a canary server that checks to see if anything goes wrong. Type
deploy to kick things off.”
You run the script and watch it go through its paces. A few minutes later, it says
INTEGRATION OK, then starts deploying the code. “That’s it!” Pedro beams. “Once the integration succeeds, you can assume the deploy will too. If something goes wrong, we’ll get paged. Welcome to the team!”
It’s been less than an hour, and you’ve already deployed to production. This is zero-friction development: when you’re ready to code, nothing gets in your way.
When you make a change, you need to get feedback in less than five seconds.
Development speed is the most important area for eliminating friction. When you make a change, you need to get feedback about that change in less than five seconds. Less than one second is best. Ten seconds at the very most.
This type of fast feedback is a game changer. You’re able to experiment and iterate so easily. Rather than making big changes, you can work in very small steps. Each change can be a line or two of code, which means that you always know where your mistakes are. Debugging becomes a thing of the past.
If feedback takes less than a second, it’s functionally instantaneous. You’ll make a change, see the feedback, and keep working. If it takes between 1 and 5 seconds, it won’t feel instantaneous, but it’s still acceptable. If it takes between 5 and 10 seconds, it will feel slow. You’ll start being tempted to batch up changes. And if it’s more than 10 seconds, you won’t be able to take small steps, and that will slow you down.
- Test-Driven Development
To achieve one-second feedback, set up a watch script that automatically checks your code when you make a change. Inside the script, use a compiler or linter to tell you when you make syntax errors, and tests to tell you when you make semantic errors.
Alternatively, you can configure your IDE to check syntax and run tests, rather than writing a script. This can be an easy way to get started, although you’ll have to migrate to a script eventually. If you do start with an IDE-based approach, make sure its configuration can be committed to your repository and used by everyone on the team. You need the ability to share improvements easily.
When you save your changes, the script (or IDE) should give you immediate, unambiguous feedback. If everything worked, it should say
OK. If anything failed, it should say
FAILED and provide information to help you troubleshoot the error. Most people make their tools display a green bar for success and a red bar for failure. I also program mine to play a sound—one for compile/lint failure, another for test failure, and a third for success—but that’s just me.
- Fast Reliable Tests
As your codebase gets larger, one-second feedback will become harder to achieve. The first culprit is usually test speed. Focus on writing fast, reliable tests.
As your system continues to grow, build speeds (compiling or linting) will become a problem. The solution will depend on your language. A web search for “speed up <language> build” will get you started. Typically, it will involve incremental builds: caching parts of the build so that only code that has changed gets rebuilt. The larger your system gets, the more creative you’ll have to be.
- Continuous Integration
Eventually, you’ll probably need to set up two builds: one for fast feedback and one for production deployment. Although it’s preferable for your local build to be the same as your production build, fast feedback is more important. Your integration script can run your tests against the production build. As long as you have a good test suite and practice continuous integration, you’ll learn about discrepancies between the two builds before they’ve had a chance to get out of control.
Although good tests run at a rate of hundreds or thousands per second, you’ll eventually have too many tests to run them all in less than a second. When you do, you’ll need to revise your script to run only a subset of the tests. The easiest way is to group your tests into clusters and run specific clusters based on the files that have changed.
Eventually, you may want to do a more sophisticated dependency analysis that detects exactly which tests to run for any given change. Some test runners can do this for you. It’s also not as hard to implement as you might think. The trick is to focus on what your team needs rather than making a generic solution that handles all possible edge cases.
Know Your Editor
Don’t let your code editor get in the way of your thoughts. This is particularly important when pairing or mobbing; when you’re navigating, there are few things more frustrating than watching a driver struggle with the editor.
Take the time to get to know your editor really, really well. If the editor provides automated refactorings, learn how to use them. (If it doesn’t, look for a better editor.) Take advantage of autoformatting, and commit the formatting configuration file to your repository so your whole team is in sync. Learn how to use code completion, automatic fixes, function and method lookup, and reference navigation. Learn the keyboard shortcuts.
For an example of how much of a difference editor proficiency can make, see Emily Bache’s virtuoso performance in her Gilded Rose videos, particularly part 2. [Bache2018]
What happens when you check out an arbitrary commit from your repository? Say, from a year ago. (Go on, try it!) Does it still run? Do the tests still pass? Or does it require some esoteric combination of tooling and external services that have passed into oblivion?
Every commit should work the same way for every developer.
A reproducible build is a build that continues to work and pass its tests no matter which development machine you use to build it, and no matter how old the code you’re building is. You should be able to check out any commit and expect it to work the same way for every developer. Generally speaking, this requires two things:
Dependencies are the libraries and tools your code requires to run. This includes your compiler or interpreter, run time environment, packages downloaded from your language’s package repository, code created by other teams in your organization, and so forth. For your build to be reproducible, everybody needs to have the exact same dependencies.
Program your build to ensure that you have the correct version of every dependency. It can either exit with an error when the wrong version is installed, or (preferably) automatically install the correct version. Tools to do so include Nix, Bazel, and Docker. Check the version of your dependency management tool, too.
An easy way to ensure your software has the correct dependencies is to check them into your repository. This is called vendoring. You can mix the two approaches: for example, a team with a Node.js codebase vendored its
node_modules directory, but didn’t vendor the Node executable. Instead, they programmed the build to fail if the wrong version of Node was running.
Dependency management ensures your code runs the same way on every machine, but it doesn’t ensure that your tests will. Your tests need to run entirely locally, without communicating over the network. Otherwise, you’re likely to get inconsistent results when two people run the tests at the same time, and you won’t be able to test old versions. The services and data they depend on will have changed, and tests that used to pass will fail.
The same is true for when you run the code manually. To get consistent results and to be able to run old versions, everything the code depends on needs to be installed locally.
There may be some dependencies you can’t run locally. If so, you need to program your tests to run independently of those dependencies, or you won’t be able to reproduce your test results in the future. The “Simulate Non-Local Dependencies” section describes how.
- Continuous Integration
If you use continuous integration, you’ll integrate several times per day. This process needs to be bulletproof, and fast. That means scripting it. Your script should report success or failure within five minutes—ten at most.
Five-minute results are surprisingly important. Five minutes is enough for a stretch break and a new cup of coffee while you keep an eye on the results. Ten minutes is tolerable, but gets tedious. More than that, and people will start working on other tasks before the results are in. Then, when the integration fails, the code will be left in limbo until somebody gets back to it. In practice, this leads to systemic integration and build problems.
The script doesn’t need to literally complete within five minutes, although that’s preferable. Instead, it needs to validate the code and report success or failure, before performing longer-running checks. The “Multistage Integration Builds” section explains how it works.
- Fast Reliable Tests
For most teams, the thing standing between them and a five-minute integration is the speed of their test suite. Fast, reliable tests are the solution.
Key Idea: Optimize for Maintenance
Code is written once, but read and modified over and over again. In a professional development environment, you’re more likely to be looking at and modifying code someone else wrote—or code that you wrote a while ago—than writing new code. Even if you are writing new code, it’s going to need to be maintained for several years. As a result, it’s much more important to decrease costs of maintenance than to make it easier to write new code.
This has far-reaching implications. A framework that makes creating software “easy,” but is difficult to understand and doesn’t integrate well with other systems, is a poor choice. A build tool that automatically handles everything you need today, but can’t be extended without deep knowledge of its internals is a similarly poor choice.
Optimizing for maintenance means choosing simple tools and libraries that are easy to understand, easy to compose, and easy to replace when they no longer fit your needs.
- Simple Design
An oft-overlooked source of friction for development teams is the complexity of their development environment. In their rush to get work done quickly, teams pull in popular tools, libraries, and frameworks to solve common development problems.
There’s nothing wrong with these tools, in isolation. But any long-lived software development effort is going to have specialized needs, and that’s where the quick and easy approach starts to break down. All those tools, libraries, and frameworks add up to an enormous cognitive burden, especially when you have to dive into their internals to make them work together nicely. That ends up causing a lot of friction.
It’s more important to optimize maintenance costs than initial development, as the “Key Idea: Optimize for Maintenance” sidebar explains. Be thoughtful about the third-party dependencies you use. When you choose one, don’t just think about the problem it’s solving; think about the maintenance burden the dependency will add, and how well it will play with your existing systems. A simple tool or library your scripts can call is a great choice. A complex black box that wants to own the world probably isn’t.
In most cases, it’s best to wrap the third-party tool or library in code you control. The job of your code is to hide the underlying complexity and present a simple interface customized for your needs. The “Third-Party Components” section explains further.
Automate every activity that your team performs repeatedly. Not only will this decrease friction, it will decrease errors, too. To begin with, this means five scripts:
build: compile and/or lint, run tests, and report success or failure
watch: automatically run
buildwhen files change
buildin a production-like environment and integrate your code
integrate, then deploy the integration branch
rundev: run the software locally for manual review and testing
You’re free to use whichever names you prefer, of course.
Use a real programming language for your scripts. Your scripts can call out to tools, and some of those tools might have their own proprietary configuration languages, but orchestrate them all with real code that you control. As your automation becomes more sophisticated, you’ll appreciate the power a real programming language provides.
Treat your scripts with the same respect as real production code. You don’t have to write tests for them—scripts can be very hard to test—but do pay attention to making your scripts well-written, well-factored, and easy to understand. You’ll thank yourself later.
You may be tempted to use your IDE instead of a
watch script. That’s okay to start with, but you’ll still need to automate your build for the
integrate script, so you could end up maintaining two separate builds. Beware of lock-in, too: eventually, the IDE won’t be able to provide one-second feedback. When that happens, rather than fighting the IDE, switch to a proper script-based approach. It’s more flexible.
Improve your automation continuously and incrementally, starting with your very first story. In a brand-new codebase, that means that your first development tasks are to set up your scripts.
Keep your automation simple. In the beginning, you don’t need sophisticated incremental builds or dependency graph analysis. Before you write any code, start by writing a
build script that simply says
BUILD OK. Nothing else! It’s like a “hello world” for your build. Then write a
watch script that does nothing but run
build when files change.
watch are working, create a similarly bare-bones
integrate script. At first, it just needs to run
build in a pristine environment and integrate your code. The “The Continuous Integration Dance” section describes how it works.
integrate is working, you’re ready to flesh out
build. Write a do-nothing entry point for your application. Maybe it just says “Hello world.” Make
build compile or lint it, then add dependency management for the compiler or linter. It can check the version against a constant to start with, or you can install a dependency management tool. Alternatively, you can vendor your dependencies.
Next, add a unit testing tool and a failing test. Be sure to add dependency management for the testing tool too. Make the build run the test, fail appropriately, and exit with an error code. Next, check that
integrate both handle failures correctly, then make the test pass.
Now you can add the
rundev script. Make
rundev compile (if needed) and run your do-nothing application, then make it automatically restart when the source files change. Refactor so
rundev don’t have duplicated file-watching or compilation code.
deploy. Have it run
integrate—don’t forget to handle failures—and then deploy the integration branch. Start by deploying to a staging server. The right way to do so depends on your system architecture, but you have only one production file, so you don’t need to do anything complicated. Just deploy that one file and its runtime environment to one server. It can be as simple as using
rsync. Anything more complicated—crash handling, monitoring, provisioning—needs a story. (For example, “Site keeps working after crash.”) As your system grows, your automation will grow with it.
If you don’t deploy to a server, but instead distribute installation packages, make `deploy` build a simple distribution package. Start with a bare-bones package, such as a ZIP file, that just contains your one production file and its runtime. Fancier and more user-friendly installation can be scheduled with user stories.
From this point forward, update your automation with every story. When you add dependencies, don’t install them manually (unless you vendor them); add them to your dependency manager’s configuration and let it install them. That way, you know it will work for other people too. When a story first involves a database, update
deploy to automatically install, configure, and deploy it. Same for stories that involve additional services, servers, and so forth.
When written out in this way, automation sounds like a lot of work. But when you build your automation incrementally, you start simple and grow your automation along with the rest of your code. Each improvement is only a day or two of work, at most, and most of your time is focused on your production code.
Automating Legacy Code
You may not have the luxury of growing your automation alongside your code. Often, you’ll add automation to an existing codebase instead.
Start by creating empty
deploy scripts. Don’t automate anything yet; just find the documentation for each of these tasks and have the script output it to the console. For example, the
deploy script might say “1. Run `esoteric_command` 2. Load `https://obscure_web_page`,” and so forth. Wait for a keypress after each step.
Such simple “automation” shouldn’t take long, so you can create each script as part of your slack. When you create each one, the script becomes your new, version-controlled source of truth. Either remove the old documentation or change it to describe how to run the script.
Next, use your slack to gradually automate each step. Automate the easiest steps first, then focus on the steps that introduce the most friction. For a while, your scripts will have a mix of automation and step-by-step instructions. Keep going until the scripts are fully automated, then start looking for opportunities to further improve and simplify.
build is fully automated, you’ll probably find that it’s too slow for one-second feedback (or even ten-second feedback). Eventually, you’ll want to have a sophisticated incremental approach, but you can start by identifying small chunks of your codebase. Provide build targets that allow you to build and test each one in isolation. The more finely you chop up the chunks, the easier it will be to get below the 10-second threshold.
Once a commonly used build target is below 10 seconds, it’s fast enough to be worth creating a
watch script. Continue optimizing, using your slack to improve a bit at a time, until you get all the targets below five seconds. At some point, modify the build to automatically choose targets based on what’s changed.
- Fast Reliable Tests
Next, improve your deployment speed and reliability. This will probably require improving the tests, so it will take a while. As before, use your slack to improve a piece at a time. When a test fails randomly, make it deterministic. When you’re slowed down by a broad test, replace it with narrow tests. The “Adding Tests to Existing Code” section explains what to do.
The code will never be perfect, but eventually, the parts you work with most frequently will be polished smooth. Continue using your slack to make improvements whenever you encounter friction.
How do we find time to automate?
- The Planning Game
The same way you find time for coding and testing: it’s simply part of the work to be done. During the planning game, when you think about the size of each story, include any automation changes the story needs.
Similarly, use your slack to make improvements when you encounter friction. But remember that slack is for extra improvement. If a story requires automation changes, building the automation—and leaving the scripts you touched at least a bit better than you found them—is part of developing the story, not part of your slack. The story’s not done until the automation is too.
Who’s responsible for writing and maintaining the scripts?
- Collective Code Ownership
They’re collectively owned by the whole team. In practice, team members with programming and operations skills take responsibility for them.
We have another team that’s responsible for build and deployment automation. What should we do?
Treat their automation the same way you treat any third-party dependency. Encapsulate their tools behind scripts you control. That will give you the ability to customize as needed.
When does database migration happen?
It’s part of your deployment, but it may happen after the deployment is complete. See the “Data Migration” section for details.
Every team can work on reducing friction. Some languages make fast feedback more difficult, but you can usually get meaningful feedback about the specific part of the system you’re currently working on, even if that means running a small subset of your tests. Fast feedback is so valuable, it’s worth taking the time to figure it out.
Your ability to run the software locally may depend on your organization’s priorities. In a multiteam environment, it’s easy to accidentally create a system that can’t be run locally. If that’s the case for you, you can still program your tests to run locally, but a way to run the whole system locally might be out of your control.
When your team has zero-friction development:
You spend your time developing, not struggling with tools, checklists, and dependency documentation.
You’re able to work in very small steps, which allows you to catch errors earlier and spend less time debugging.
Setting up a new development workstation is a simple matter of cloning the repository and running a script.
You’re able to integrate and deploy multiple times per day.
Alternatives and Experiments
Zero-friction development is an ideal that every team should strive for. The best way to do it depends on your situation, so feel free to experiment.
Some teams rely on their IDE, rather than scripting, to provide the automation they need. Others use large “kitchen-sink” tools with complicated configuration languages. I find that these approaches tend to break down as the needs of the team grow. They can be a convenient way to get started, but when you outgrow them, switching tends to be painful, and difficult to do incrementally. Be skeptical when evaluating complicated tools that promise to solve all your automation needs.