AoAD2 Practice: Zero Friction

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Zero Friction

Audience
Programmers, Operations

When we’re ready to code, nothing gets in our way.

Imagine you’ve just started working with a new team. One of your new teammates, Pedro, walks you over to a development workstation.

“Since you’re new, we’ll start by deploying a small change,” he says, sitting down next to you. “This machine is brand-new, so we’ll have to set it up from scratch. First, clone the repo.” He tells you the command. “Now, run the build script.”

Commands start scrolling up the screen. Pedro explains. “We use a tool for reproducible builds. It uses a configuration file in the repo to make sure we all have the same tooling installed. Right now, it’s detected that you don’t have anything installed, so it’s installing the IDE, development tools, and images needed to develop and run the system locally.”

“This will take a while,” he continues. “After the first run, though, it’s instantaneous. It only updates again when we commit changes to the config. Come on, I’ll show you around the office.”

When you come back, the build is done. “Okay, let me show you the app,” Pedro says. “Type rundev to start it up.” Once again, information starts scrolling by. “This is all running locally,” Pedro explains proudly. “We used to have a shared test environment, and we were constantly stepping on each others’ toes. Now that’s all in the past. It even knows which services to restart depending on which files you change.”

Pedro walks you through the application. “Now let’s make a change,” he says. “Open up the IDE and run the watch script with the quick command. It will run the build when files change. The quick command tells it to only build and test the files that have changed.”

You follow his instructions and the script starts up, then immediately reports BUILD OK in green. “Nothing’s changed since we last ran the build,” Pedro explains, “so the script didn’t do anything. Now, let’s make a small change.” He directs you to a test file and has you add a test. When you save the changes, the watch script runs again and reports a test failure. It takes less than a second.

“We’ve put a lot of work into our build and test speed,” Pietro tells you. He’s clearly proud of it. “It wasn’t easy, but it’s totally worth it. We get feedback on most changes in a second or two. It’s done wonders for our ability to iterate and be productive. I’m not lying when I say this is the best development environment I’ve ever been in.”

“Now let’s finish up this change and deploy.” He shows you the production change needed to get the new test pass. Once again, when you save, the watch script runs the tests in about a second. This time, it reports success.

“Okay, we’re ready to deploy,” he says. “This is going into production, but don’t worry. The deploy script will run the full test suite, and we also have a canary server that checks to see if anything goes wrong. Type deploy to kick things off.”

You run the script and watch it go through its paces. A few minutes later, it says INTEGRATION OK, then starts deploying the code. “That’s it!” Pedro beams. “Once the integration succeeds, you can assume the deploy will too. If something goes wrong with the canary server, it will roll back the deploy and we’ll get a page. Welcome to the team!”

It’s been less than an hour, and you’ve already deployed to production. This is zero-friction development: when you’re ready to code, nothing gets in your way.

One-Second Feedback

When you make a change, get feedback in less than a second, or five at most.

Development speed is the most important area for eliminating friction. When you make a change, you need to get feedback about that change in less than a second, or five seconds at the very most.

This type of fast feedback is a game changer. You’re able to experiment and iterate so easily. Rather than making big changes, you can work in very small steps. Each change can be a line or two of code, which means that you always know where your mistakes are. Debugging becomes a thing of the past.

If feedback takes less than a second, it’s functionally instantaneous. You’ll make a change, see the feedback, and keep working. If it takes between one and five seconds, it won’t feel instantaneous, but it’s still acceptable. If it takes between five and ten seconds, it will feel slow. You’ll start being tempted to batch up changes. And if it’s more than ten seconds, you won’t be able to take small steps, and that will slow you down.

Ally
Test-Driven Development

To achieve one-second feedback, set up a watch script that automatically checks your code when you make a change. Inside the script, use a compiler or linter to tell you when you make syntax errors, and tests to tell you when you make semantic errors.

Alternatively, you can configure your IDE to check syntax and run tests, rather than writing a script. This can be an easy way to get started, although you’ll have to migrate to a script eventually. If you do start with an IDE-based approach, make sure its configuration can be committed to your repository and used by everyone on the team. You need the ability to share improvements easily.

When you save your changes, the script (or IDE) should give you immediate, unambiguous feedback. If everything worked, it should say OK. If anything failed, it should say FAILED, and provide information to help you troubleshoot the error. Most people make their tools display a green bar for success and a red bar for failure. I also program mine to play a sound—one for compile/lint failure, another for test failure, and a third for success—but that’s entirely optional.

As your codebase gets larger, one-second feedback will become harder to achieve. The first culprit is usually test speed. Instead of writing broad tests that check the whole system, write narrow tests that focus on the behavior of a small amount of code. Stub out slow and brittle parts of the system, such as file system, network, and database access. “Fast and Reliable Tests” on p.XX describes how.

As your system continues to grow, build speeds (compiling or linting) will become a problem. The solution will depend on your language. A web search for “speed up <language> build” will get you started. Typically, it will involve incremental builds: caching parts of the build so that only code that has changed gets rebuilt. The larger your system gets, the more creative you’ll have to be.

Ally
Continuous Integration

Eventually, you’ll probably need to set up two builds: one for fast feedback, and one for production deployment. Although it’s preferable for your local build to be the same as your production build, fast feedback is more important. Your deploy script can run your tests against the production build. As long as you have a good test suite and practice continuous integration, you’ll learn about discrepancies between the two builds before they’ve had a chance to get out of control.

Although good tests run at a rate of hundreds or thousands per second, you’ll eventually have too many tests to run them all in less than a second. When you do, you’ll need to revise your script to only run a subset of the tests. The easiest way is to group your tests into clusters, and run specific clusters based on the files that have changed.

Eventually, you may want to do a more sophisticated dependency analysis that detects exactly which tests to run for any given change. Some test runners can do this for you. It’s also not as hard to implement as you might think. The trick is to focus on what your team needs rather than making a generic solution that handles all possible edge cases.

Know Your Editor

Don’t let your code editor get in the way of your thoughts. This is particularly important when pairing or mobbing; when you’re navigating, there are few things more frustrating than watching a driver struggle with the editor.

Take the time to get to know your editor really, really well. Learn the keyboard shortcuts. If the editor provides automated refactorings, learn how to use them. (If it doesn’t, look for a better editor.) Learn their keyboard shortcuts, too. Take advantage of auto-formatting, and commit the formatting configuration file to your repository so your whole team is in sync. Learn how to use code completion, automatic fixes, function and method lookup, and reference navigation. And learn the keyboard shortcuts.

For an example of how much of a difference editor proficiency can make, see Emily Bache’s virtuoso performance in her Gilded Rose kata videos, particularly part 2. [Bache 2018]

Reproducible Builds

It worked on my machine!

Overheard

What happens when you check out an arbitrary commit from your repository? Say, from a year ago. (Go on, try it!) Does it still run? Do the tests still pass? Or does it require some esoteric combination of tooling and external services that have long since passed from memory into oblivion?

You should be able to pull any commit and expect it to work the same for every developer.

A reproducible build is a build that continues to work and pass its tests no matter which development machine you use to build it, and no matter how old the code you’re building is. You should be able to pull any commit and expect it to work the same way for every developer. Generally speaking, this requires two things:

1. Dependency Management

Dependencies are the libraries and tools your code requires to run. This includes your compiler or interpreter, run-time environment, packages downloaded from your language’s package repository, code created by other teams in your organization, and so forth. For your build to be reproducible, everybody needs to have the exact same dependencies.

In your build, check the version of every dependency, including tools such as your compiler. If a dependency is missing or using the wrong version, the build should either exit with an error or (preferably) install the correct version. Tools to do so include Nix, Bazel, and Docker. Check that you’re using the right version of your dependency management tool, too.

An easy way to ensure your software has the correct dependencies is to check them into your repository. This is called vendoring. It works best when your dependencies come in the form of source code rather than binaries. You can mix the two approaches: for example, a team with a Node.js codebase vendored its node_modules directory, but didn’t vendor the Node executable. Instead, they programmed the build to fail if the wrong version of Node was running.

2. Local Builds

Dependency management will ensure that your code runs the same way on every machine, but it won’t ensure that your tests pass. For your tests to pass, they need to run entirely locally, without communicating over the network. Otherwise, you’re likely to get inconsistent results when two people run the tests at the same time, and you won’t be able to build old versions. The services and data they depend on will have changed, and tests that used to pass will fail.

The same is true for when you run the code manually. To get consistent results and to be able to run old versions, everything the code depends on needs to be installed locally.

In some cases, it’s too difficult or expensive to run every dependency locally. You still need to be able to run your tests locally, though, for both reproducibility and speed. To do so, write your tests to use fake versions of any service you can’t run locally. The “Spy Server” pattern in [Shore 2018] describes how. For videos demonstrating the technique, see episodes 17 and 18 of [Shore 2020].

This raises the question: if you don’t test your software against its real dependencies, how do you know that it works? Because external services can change or fail at any time, the real answer is “monitoring.” (See the “Paranoic Telemetry” pattern of [Shore 2018].) But you can also run additional tests as a safety net in your deployment script.

Five-Minute Deploy

Your deploy script should report success or failure within five minutes—ten at most.

Create scripts for everything your team repeats. One of the most common examples is deployment. Your deploy script should create a production-grade build, run the full test suite against that build, integrate your code, deploy the code to production, and report success or failure within five minutes—ten at most.

A five-minute deploy is important because you need to remain available to fix any problems. Five minutes is enough for a stretch break and a new cup of coffee. Ten minutes is tolerable, but gets tedious. More than ten minutes and people start working on other tasks. Then, when a deploy fails, the code is left in limbo until somebody gets back to it.

Ally
Continuous Deployment

The deploy doesn’t need to literally complete within five minutes, although that’s preferable. Instead, it needs to report success or failure. After that, failures should be exceedingly rare. Typically, that means creating a production-grade build and running the main test suite. Additional pre-deployment actions, such as deploying to a canary server or running additional tests, can take longer. But they should only fail rarely, and when they do fail, the team needs to be alerted in some way.

For most teams, the thing standing between them and a five-minute deploy is the speed of their test suite. Focus on building narrow tests rather than broad end-to-end tests. Unreliable tests—tests that fail randomly—are another common problem that slows down deployment. “Fast and Reliable Tests” on p.XX explains how to fix both problems.

Control Complexity

An oft-overlooked source of friction for development teams is the complexity of their development environment. In their rush to get work done quickly, teams pull in popular tools, libraries, and frameworks to solve common development problems.

There’s nothing wrong with these tools, in isolation. But any long-lived software development effort is going to have specialized needs, and that’s where the quick and easy approach starts to break down. All those tools, libraries, and frameworks add up to an enormous cognitive burden, especially when you have to start diving into their internals to make them work together nicely. That ends up causing a lot of friction.

It’s more important to optimize maintenance costs than initial development, as “Key Idea: Optimize for Maintenance” on p.XX explains. Be thoughtful about the third-party dependencies you use. When you choose one, don’t just think about the problem it’s solving; think about the maintenance burden the dependency will add, and how well it will play with your existing systems. A simple tool or library your scripts can call is a great choice. A complex black box that wants to own the world probably isn’t.

Ally
Simple Design

In most cases, it’s best to wrap the third-party tool or library in code you control. The job of your code is to hide the underlying complexity and present a simple interface customized for your needs. The “Simple Design” practice explains further.

Automate Incrementally

Improve your automation continuously and incrementally, starting with your very first story. In a brand-new codebase, that means that your first development task is to set up your scripts.

Automate every repeated activity. To begin with, this means writing four scripts:

  • build: compile and/or lint, run tests, and report success or failure

  • watch: automatically run build when files change

  • deploy: run build in a production-like environment, integrate your code, and deploy

  • rundev: run the software locally for manual review and testing

You’re free to use whichever names you prefer, of course.

Keep your automation simple. For that first story, you don’t need sophisticated incremental builds or dependency graph analysis. Before you write any code, start by writing a build script that simply says BUILD OK. Nothing else! It’s like a “hello world” for your build.

Next, write a watch script that runs the build when files in your source tree are added, removed, or changed. Make sure it handles changes to the watch and build scripts themselves. Have it report how long the build takes, too. When that time exceeds five seconds, you’ll know it’s time to optimize.

The best way to detect file changes depends on your scripting language, but somebody’s probably written a library you can use. Try searching the web for “<language> watch for file changes.”

You may be tempted to use your IDE instead of a watch script. That’s okay, to start with, but you’ll still need to automate your build for the deploy script, so you could end up maintaining two separate builds. Beware of lock-in, too: eventually, the IDE won’t be able to provide one-second feedback. When that happens, rather than fighting the IDE, switch to a proper script-based approach. It’s more flexible.

Speaking of scripting languages, use a real programming language for your scripting. Your scripts can call out to tools, and some of those tools might have their own proprietary configuration languages, but orchestrate them all with real code that you control. As your automation becomes more sophisticated, you’ll appreciate the power a real programming language provides.

Treat your scripts with the same respect as real production code, too. You don’t have to write tests for them—if you want to, it’s a good idea, but scripts can be very hard to test—but do pay attention to making your scripts well-written, well-factored, and easy to understand. You’ll thank yourself later.

Once you have a bare-bones watch script, create a similarly bare-bones deploy script. At first, it just needs to run build in a pristine environment and integrate your code. There are many tools that will do this for you, typically under the name “continuous integration server” or “build server.” Be sure to get one that integrates after the build succeeds, not before.

When deploy is working, you’re ready to flesh out build. Write a do-nothing entry point for your application. Maybe it just says “Hello world.” Make build compile or lint it, then add dependency management for the compiler or linter. It can just check the version against a constant, to start with, or you can install a dependency management tool. Alternatively, you can vendor your dependencies.

Next, add a unit testing tool and a failing test. Be sure to add dependency management for the testing tool too. Make the build run the test, fail appropriately, and exit with an error code. Next, check that watch and deploy both handle failures correctly, then make the test pass.

Now you can add the rundev script. Make rundev compile (if needed) and run your do-nothing application, then make it recompile and rerun when the source files change. Refactor so that build, watch, and rundev don’t have duplicated file-watching or compilation code.

Ally
Continuous Deployment

Finally, flesh out deploy with a simple deployment step. Start by deploying to a staging server. The right way to do so depends on your system architecture, but you only have one production file, so you don’t need to do anything complicated. Just deploy that one file to one server. It can be as simple as using scp or rsync. Anything more complicated—crash handling, monitoring, provisioning—needs a story. (For example, “Site keeps working after crash.”) As your system grows, your automation will grow with it.

If you don’t deploy to a server, but instead distribute installation packages, make deploy build a simple distribution package. Start with a bare-bones package, such as a .zip file, that just contains your one production file. Fancier and more user-friendly installation can be scheduled with user stories.

You should be able to pull any commit and expect it to work the same for every developer.

From this point forward, update your automation with every story. When you add dependencies, don’t install them manually (unless you vendor them); add them to your dependency manager’s configuration and let it install them. That way, you know it will work for other people too. When a story first involves a database, update build, rundev, and deploy to automatically install, configure, and deploy it. Same for stories that involve additional services, servers, and so forth.

When written out in this way, automation sounds like a lot of work. But when you build your automation incrementally, you start simple and grow your automation along with the rest of your code. Each improvement is only a day or two of work, at most, and most of your time is focused on your production code.

Automating Legacy Code

You may not have the luxury of growing your automation alongside your code. Often, you’ll add automation to an existing codebase instead.

Start by create empty build, rundev, and deploy scripts. Don’t automate anything yet; just find the documentation for each of these tasks and copy it into the corresponding script. For example, the rundev script might say “1. Run `esoteric_command` 2. Load `https://obscure_web_page`,” and so forth. Wait for a keypress after each step.

Ally
Slack

Such simple automation shouldn’t take long, so you can create each script as part of your slack. When you create each one, the script becomes your new, version-controlled source of truth. Either remove the old documentation or change it to describe how to run the script.

Next, use your slack to gradually automate each step. Start with the low-hanging fruit and automate the easiest steps first, then focus on the steps that introduce the most friction. For a while, your scripts will have a mix of automation and step-by-step instructions. Keep going until the scripts are fully automated, then start looking for opportunities to further improve and simplify.

When build is fully automated, you’ll probably find that it’s too slow for one-second feedback (or even five-second feedback). Eventually, you’ll want to have a sophisticated incremental approach, but you can start by identifying small chunks of your codebase. Provide build targets that allow you to build and test each one in isolation. The more finely you chop up the chunks, the easier it will be to get below the five-second target.

Once a commonly-used build target is below ten seconds, it’s fast enough to be worth creating a watch script. Continue optimizing, using your slack to improve a bit at a time, until you get all the targets below five seconds. At some point, modify the build to automatically choose targets based on what’s changed.

Next, improve your deployment speed and reliability. This will probably require improving the tests, so it will take a while. As before, use your slack to improve a piece at a time. When a test fails randomly, make it deterministic. When you’re slowed down by a broad test, replace it with narrow tests. “Adding Tests to Existing Code” on p.XX explains what to do.

The code will never be perfect, but eventually, the parts you work with most frequently will be polished smooth. Continue using your slack to make improvements whenever you encounter friction.

Questions

How do we find time to automate?

Ally
The Planning Game
Done Done

The same way you find time for coding and testing: it’s simply part of the work to be done. During the planning game, when you size each story, include any automation changes the story needs.

Use your slack to make improvements when you encounter friction.

Similarly, use your slack to make improvements when you encounter friction. But remember that slack is for extra improvement. If a story requires automation changes, building the automation—and leaving the scripts you touched at least a bit better than you found them—is part of developing the story, not part of your slack. The story’s not done until the automation is too.

Who’s responsible for writing and maintaining the scripts?

Ally
Collective Code Ownership

They’re collectively owned by the whole team. In practice, team members with programming and operations skills take responsibility for them.

We have another team that’s responsible for build and deployment automation. What should we do?

Treat their automation in the same way you treat any third-party dependency. Encapsulate their tools behind scripts you control. That will give you the ability to customize as needed.

When does database migration happen?

It’s part of your deployment, but it may happen after the deployment is complete. See “Continuous Deployment” on p.XX for details.1

1XXX update with specific reference when CD practice done.

Prerequisites

Every team can work on getting one-second feedback. Some languages make fast feedback more difficult, but you can usually get meaningful feedback about the specific part of the system you’re currently working on, even if that means running a small subset of your tests. Fast feedback is so valuable, it’s worth taking the time to figure it out.

Your ability to run the software locally may depend on your organization’s priorities. In a multi-team environment, it’s easy to accidentally create a system that can’t be run locally. If that’s the case for you, you can still program your tests to run locally, but running the whole system manually might be out of your control.

Ally
Continuous Integration

Your operations team or organization may not want you to use continuous deployment. If so, you can create an integrate script instead of a deploy script. It’s the same thing, but without the deployment part: it runs build in a pristine environment, then integrates the code.

In some cases, your company may not allow you to install a continuous integration server. You don’t need a tool for continuous integration, though; just a spare development machine. See “Continuous Integration” on p.XX for details.2

2XXX Provide more specific reference after Continuous Integration practice is written.

Indicators

When your team has zero-friction development:

  • You spend your time developing, not struggling with tools, checklists, and dependency documentation.

  • You’re able to work in very small steps, which allows you to catch errors earlier and spend less time debugging.

  • Setting up a new development workstation is a simple matter of cloning the repository and running a script.

  • You’re able to integrate and deploy multiple times per day.

Alternatives and Experiments

Zero-friction development is an ideal that every team should strive for. The best way to do it depends on your situation, so feel free to experiment.

Some teams rely on their IDE, rather than scripting, to provide the automation they need. Others use large “kitchen-sink” tools with complicated configuration languages. I find that these approaches tend to break down as the needs of the team grow. They can be a convenient way to get started, but when you outgrow them, switching tends to be painful and difficult to do incrementally. Use caution when evaluating complicated tools that promise to solve all your automation needs.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

If you liked this entry, check out my best writing and presentations, and consider subscribing to updates by email or RSS.