May
27
2022
Art of Agile Development Book Club: Continuous Deployment

Fridays from 8:00 – 8:45am Pacific, I host an online discussion inspired by the new edition of my book, The Art of Agile Development. Each session uses a chapter from the book as a jumping-off point for a wide-ranging discussion about Agile ideas and practices.

Attendance is free! No sign-up needed.

To learn more about The Art of Agile Development, see the book home page. You can buy a copy from Amazon or your favorite purveyor of software development books.

May 27th: Continuous Deployment

Most software development has a hidden delay between the team saying “We’re done” and when it’s actually ready to release. Sometimes that delay can stretch on for months. It’s the little things: getting everyone’s code to work together, writing a deploy script, updating data, and so forth. Continuous integration and deployment resolve this risk. In this session, Kelsey Hightower joins us to talk about how it works.

Kelsey Hightower got his start as an entrepreneur at a young age. He was an early adopter of cloud technologies and now works as a Principal Engineer at Google Cloud, where he helps people and organizations learn to be better. He’s a popular speaker and past chair of several high-profile technology conferences. His book, Kubernetes: Up & Running, was co-authored with two of the creators of Kubernetes. It’s now available in its second edition.

WhenMay 27th, 8-8:45am Pacific (calendar invite)
Where🎙 Zoom link
Reading 📖 Continuous Integration
📖 Continuous Deployment
📖 Feature Flags
Discussion🧑‍💻 Discord invite

Discussion prompts:

  • Let’s talk about organizational change. Continuous deployment and continuous integration can be a big mindset shift for a lot of organizations. What’s involved with making that shift?

  • Continuous integration and deployment also relies on a good automated build and automated deployment. What are some important things to keep in mind when building this automation?

  • Data migration often involves big, irreversible changes. What are some tricks for performing data migration safely?

  • Feature flags and keystone interfaces allow teams to deploy changes without releasing in-progress features. What are the plusses and minuses of these techniques?

Future Sessions

We’ve got a great lineup of topics and guest speakers coming up. Add them to your calendar!

  • June 3rd: Forecasting and Roadmaps with Todd Little
  • June 10th: Agile Management with Johanna Rothman and Elisabeth Hendrickson
  • June 17th: (no session)
  • June 24th: Evolutionary System Architecture
  • ...and more to come!

Session Recordings

Note: The Art of Agile Development Book Club sessions are recorded. By appearing on the show, you consent to be recorded and for your appearance to be edited, broadcast, and distributed in any format and for any purpose without limitation, including promotional purposes. You agree Titanium I.T. LLC owns the copyright to the entire recording, including your contribution, and has no financial obligations to you as the result of your appearance. You acknowledge that your attendance at the book club is reasonable and fair consideration for this agreement.

If you don’t want to be recorded, that’s fine—just keep your camera and microphone muted. You’re still welcome to attend!

AoAD2 Practice: Feature Flags

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Feature Flags

Audience
Programmers

We deploy and release independently.

For many teams, releasing their software is the same as deploying their software. They deploy a branch of their code repository into production, and everything in that branch is released. If there’s anything they don’t want to release, they store it in a separate branch.

Allies
Continuous Integration
Continuous Deployment

That doesn’t work for teams using continuous integration and deployment. Other than short-lived development branches, they have only one branch: their integration branch. There’s nowhere for them to hide unfinished work.

Feature flags, also known as feature toggles, solve this problem. They hide code programmatically, rather than using repository branches. This allows teams to deploy unfinished code without releasing it.

Feature flags can be programmed in a variety of ways. Some can be controlled at runtime, allowing people to release new features and capabilities without redeploying the software. This puts releases in the hands of business stakeholders, rather than programmers. They can even be set up to release the software in waves, or to limit releases to certain types of users.

Keystones

Strictly speaking, the simplest type of feature flag isn’t a feature flag at all. Kent Beck calls it a “keystone.” [Beck2004] (ch. 9) It’s easy: when working on something new, wire up the UI last. That’s the keystone. Until the keystone is in place—until the UI is wired up—nobody will know the new code exists, because they won’t have any way to access it.

For example, when I migrated a website to use a different authentication service, I started by implementing an infrastructure wrapper for the new service. I was able to do most of that work without wiring it up to the login button. Until I did, users were unaware of the change, because the login button still used the old authentication infrastructure.

Allies
Test-Driven Development
Fast Reliable Tests

This does raise the question: if you can’t see your changes, how do you test them? The answer is test-driven development and narrow tests. Test-driven development allows you to check your work without seeing it run. Narrow tests target specific functions without requiring them to be hooked up to the rest of your application.

Eventually, of course, you’ll want to see the code run, either to fine-tune the UI (which can be difficult to test-drive), for customer review, or just to double-check your work. TDD isn’t perfect, after all.

Design your new code to be “wired up” with a single line. When you want to see it run, add that line. If you need to integrate before you’re ready to release, comment that line out. When you’re ready to release, write the appropriate test and uncomment the line one final time.

Keystones don’t have to involve a user interface. Anything that hides your work from customers can be used as a keystone. For example, one team used continuous deployment for a rewrite of its website. The team deployed the new site to a real production server, but the server didn’t receive any production traffic. Nobody outside the company could see the new site until the team switched production traffic from the old server to the new one.

Keystones are my preferred approach to hiding incomplete work.

Keystones are my preferred approach to hiding incomplete work. They’re simple, straightforward, and don’t require any special maintenance or design work.

Feature Flags

Feature flags are just like keystones, except they use code to control visibility, not a comment. Usually, it’s a simple if statement.

To continue the authentication example, remember that I programmed my new authentication infrastructure without wiring it up to the login button. Before I could wire it up, I needed to test it in production because there were complicated interactions between the third-party service and my systems. But I didn’t want my users to use the new login before I tested it.

I solved this dilemma with a feature flag. My users saw the old login; I saw the new one. The code worked like this (Node.js):

if (siteContext.useAuth0ForAuthentication()) {
  // new Auth0 HTML
}
else {
  // old Persona HTML
}

As part of the change, I had to implement a new email validation page. It wasn’t exposed to existing users, but it was still possible for people to manually type in the URL, so I also used the feature flag to redirect them away:

httpGet(siteContext) {
  if (!siteContext.useAuth0ForAuthentication()) return redirectToAccountPage();
  ⋮
}

Feature flags are real code. They need the same attention to quality as the rest of your code. For example, the email validation page had this test:

it("redirects to account page if Auth0 feature flag is off", function() {
  const siteContext = createContext({ auth0: false });
  const response = httpGet(siteContext);
  assertRedirects(response, "/v3/account"));
});

Be sure to remove feature flags after they’re no longer needed. This can be easy to forget, which is one of the reasons I prefer keystones to feature flags. To help you remember, you can add a reminder to your team calendar or a “remove flag” story to your team’s plan. Some teams program their flag code to log an alert or fail tests after its expiration date has passed.

How does your code know when the flag is enabled? In other words, where do you implement your equivalent of useAuth0ForAuthentication()? You have several options.

Application configuration

Application configuration is the most straightforward way to control your feature flags. Your configuration code can pull the state of the flag from a constant, an environment variable, a database, or whatever you like. A constant is simplest, so it’s my first choice, but an environment variable or database will allow you to enable or disable the flag on a machine-by-machine basis, which allows you to perform incremental releases.

User configuration

If you want to enable your flag based on who’s logged in, make it a privilege attached to your user or account abstraction. For example, user.privileges.logsInWithAuth0(). You can use it perform incremental releases based on subsets of users and selectively release features for the purpose of testing ideas.

Feature flags are easy to implement, but they can be complicated to manage. Once you start getting into incremental releases and user segmentation, it’s worth looking into the many tools and services for managing them.

Don’t confuse feature flags with user access control. Although feature flags can be used to hide a feature from a user, they’re a way of temporarily hiding new features users would otherwise have access to. User access control, in contrast, is for hiding features users should never have access to. They might both be implemented with user privileges, but they should be managed separately.

A feature flag is a way of temporarily hiding new features.

For example, if you create a new white-labeling feature for your enterprise customers, you might use a feature flag to gradually roll it out to those customers. However, you would also implement a user privilege that restricted access to enterprise customers. That way, when the feature flag code is removed, enterprise customers will continue to be the only people with access, and there’s no risk of accidentally enabling the feature flag for the wrong users.

Secrets

In some cases, you’ll want to enable a flag on a case-by-case basis, but you won’t be able to attach that privilege to a user. For example, during my authentication transition, I needed to enable the new login button before I was actually logged in.

For these cases, you can use a secret to enable the flag. In client-based applications, the secret can take the form of a special file in the filesystem. For server-based applications, a cookie or other request header will work. That’s what I did for my authentication flag. I programmed the code to look for a secret cookie that could be set only by logging in as an administrator.

Secret-based flags are riskier than configuration-based flags. If the secret gets out, anybody can enable the feature. They’re also harder to set up and control. I use them only as a last resort.

Prerequisites

Allies
Collective Code Ownership
Reflective Design

Anybody can use keystones. Feature flags run the risk of growing out of control, so your team needs to pay attention to their design and removal, especially as they multiply. Collective code ownership and reflective design help.

Despite their superficial similarity to privileges that control user access to features, feature flags are meant to be temporary. Don’t use feature flags as a replacement for proper user access control.

Indicators

When you use keystones and feature flags well:

  • Your team can deploy software that includes incomplete code.

  • Releasing is a business decision, not a technical decision.

  • Flag-related code is clean, well-designed, and well-tested.

  • Flags and their code are removed after the corresponding feature is released.

Alternatives and Experiments

Allies
Refactoring
Reflective Design

Feature branches are a common alternative to keystones and feature flags. When someone starts working on a new feature, they create a branch, and they don’t merge that branch back into the rest of the code until the feature is done. This is effective at keeping unfinished changes out of the hands of customers, but significant refactorings tend to cause merge conflicts. This makes it a poor choice for Delivering teams, which rely on refactoring and reflective design to keep costs low.

Keystones are so simple, they don’t leave a lot of room for experimentation. Feature flags, on the other hand, are ripe for exploration. Look for ways to keep your feature flags organized and the design clean. Consider how your flags can provide new business capabilities. For example, feature flags are often used for A/B testing, which involves showing different versions of your software to different users, then making decisions based on the results.

As you experiment, remember that simpler is better. Although keystones may seem like a cheap trick, they’re very effective, and they keep the code clean. It’s easy for feature flags to get out of control. Stick with simple solutions whenever you can.

Further Reading

Martin Fowler goes into more detail about keystones in [Fowler2020a].

Pete Hodgson has a very thorough discussion of feature flags in [Hodgson2017].

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

AoAD2 Practice: Continuous Deployment

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Continuous Deployment

Audience
Programmers, Operations

Our latest code is in production.

If you use continuous integration, your team has removed most of the risk of releasing. Done correctly, continuous integration means your team is ready to release at any time. You’ve tested your code and exercised your deployment scripts.

Ally
Continuous Integration

One source of risk remains. If you don’t deploy your software to real production servers, it’s possible that your software won’t actually work in production. Differences in environment, traffic, and usage can all result in failures, even in the most carefully tested software.

Continuous deployment resolves this risk. It follows the same principle as continuous integration: by deploying small pieces frequently, you reduce the risk that a big change will cause problems, and you make it easy to find and fix problems when they do occur.

Although continuous deployment is a valuable practice for fluent Delivering teams, it’s optional. If your team is still developing their fluency, focus on the other practices first. Full adoption of continuous integration, including automated deployments to a test environment (which some people call “continuous delivery”), will give you nearly as much benefit.

How to Use Continuous Deployment

Allies
Zero Friction
Continuous Integration
No Bugs
Feature Flags
Build for Operation

Continuous deployment isn’t hard, but it has a lot of preconditions:

  • Create a zero-friction, zero-downtime deploy script that automatically deploys your code.

  • Use continuous integration to keep your code ready to release.

  • Improve quality to the point that your software can be deployed without manual testing.

  • Use feature flags or keystones to decouple deployments from releases.

  • Establish monitoring to alert your team of deployment failures.

Once these preconditions are met, enabling continuous deployment is just a matter of running deploy in your continuous integration script.

Ally
Whole Team

The details of your deploy script will depend on your organization. Your team should include people with operations skills who understand what’s required. If not, ask your operations department for help. If you’re on your own, Continuous Delivery [Humble2010] and The DevOps Handbook [Kim2016] are both useful resources.

Your deploy script must be 100 percent automated. You’ll be deploying every time you integrate, which will be multiple times per day, and could even be multiple times per hour. Manual steps introduce delays and errors.

Detecting Deployment Failures

Your monitoring system should alert you if a deployment fails. At a minimum, this involves monitoring for an increase in errors or a decrease in performance, but you can also look at business metrics such as user sign-up rates. Be sure to program your deploy script to detect errors, too, such as network failures during deploy. When the deploy is complete, have your deploy script tag the deployed commit with “success” or “failure.”

To reduce the impact of failure, you can deploy to a subset of servers, called canary servers, and automatically compare metrics from the old deploy to the new deploy. If they’re substantially different, raise an alert and stop the deploy. For systems with a lot of production servers, you can also have multiple waves of canary servers. For example, you could start by deploying to 10% of servers, then 50%, and finally all.

Resolving Deployment Failures

One of the advantages of continuous deployment is that it reduces the risk of deployment. Because each deploy represents only a few hours of work, they tend to be low impact. If something does go wrong, the change can be reverted without affecting the rest of the system.

When a deployment does go wrong, immediately “stop the line” and focus the entire team on fixing the issue. Typically, this will involve rolling back the deploy.

Roll back the deploy

Start by restoring the system to its previous working state. This typically involves a rollback, which restores the previous deploy’s code and configuration. To do so, you can keep each deploy in a version control system, or you can just keep a copy of the most recent deploy.

One of the simplest ways to enable rollback is to use blue/green deployment. To do so, create two copies of your production environment, arbitrarily labeled “blue” and “green,” and configure your system to route traffic to one of the two environments. Each deploy toggles back and forth between the two environments, allowing you to roll back by routing traffic to the previous environment.

For example, if “blue” is active, deploy to “green.” When the deploy is complete, stop routing traffic to “blue” and route it to “green” instead. If the deploy fails, rolling back is a simple matter of routing traffic back to “blue.”

Occasionally, the rollback will fail. This may indicate a data corruption issue or a configuration problem. Either way, it’s all hands on deck until the problem is solved. Site Reliability Engineering [Beyer2016] has practical guidance about how to respond to such incidents in chapters 12–14.

Fix the deploy

Rolling back the bad deploy will usually solve the immediate production problem, but your team isn’t done yet. You need to fix the underlying issue. The first step is to get your integration branch back into a known-good state. You’re not trying to fix the issue, yet, you’re just trying to get your code and production environment back into sync.

Start by reverting the changes in the code repository, so your integration branch matches what’s actually in production. If you use merge commits in git, you can just run git revert on the integration commit. Then use your normal continuous integration process to integrate and deploy the reverted code.

Deploying the reverted code should proceed without incident because you’re deploying the same code that’s already running. It’s important to do so anyway, because it ensures your next deploy starts from a known-good state. Also, if this second deploy also has problems, it narrows the issue down to a deployment problem, not a problem in your code.

Ally
Incident Analysis

Once you’re back in a known-good state, you can fix the underlying mistake. Create tasks for debugging the problem—usually, the people who deployed it will fix it—and everybody can go back to working normally. After it’s been resolved, schedule an incident analysis session to learn how to prevent this sort of deployment failure from happening in the future.

Alternative: Fix forward

Some teams, rather than rolling back, fix forward. They make a quick fix—possibly by running git revert—and deploy again. The advantage of this approach is that you fix problems using your normal deployment script. Rollback scripts can go out of date, causing them to fail just when you need them the most.

On the other hand, deploy scripts tend to be slow, even if you have an option to disable testing (which isn’t necessarily a good idea). A well-executed rollback script can complete in a few seconds. Fixing forward can take a few minutes. During an outage, those seconds count. For this reason, I tend to prefer rolling back, despite the disadvantages.

Incremental Releases

Ally
Feature Flags

For large or risky changes, run the code in production before you reveal it to users. This is similar to a feature flag, except that you’ll actually exercise the new code. (Feature flags typically prevent the hidden code from running at all.) For additional safety, you can release the feature gradually, enabling a subset of users at a time.

The DevOps Handbook [Kim2016] calls this a dark launch. Chapter 12 has an example of Facebook using this approach to release Facebook Chat. The chat code was loaded onto clients and programmed to send invisible test messages to the backend service, allowing Facebook to load-test the code before rolling it out to customers.

Data Migration

Database changes can’t be rolled back—at least, not without risking data loss—so data migration requires special care. It’s similar to performing an incremental release: first you deploy, then you migrate. There are three steps:

  1. Deploy code that understands both the new and old schema. Deploy the data migration code at the same time.

  2. After the deploy is successful, run the data migration code. It can be started manually, or automatically as part of your deploy script.

  3. When the migration is complete, manually remove the code that understands old schema, then deploy again.

Separating data migration from deployment allows each deploy to fail, and be rolled back, without losing any data. The migration occurs only after the new code has proven to be stable in production. It’s slightly more complicated than migrating data during deployment, but it’s safer, and it allows you to deploy with zero downtime.

Migrations involving large amount of data require special care, because the production system needs to remain available while the data is migrating. For these sorts of migrations, write your migration code to work incrementally—possibly with a rate limiter, for performance reasons—and have it use both schema simultaneously. For example, if you’re moving data from one table to another, your code might look at both tables when reading and updating data, but only insert data into the new table.

Allies
Task Planning
Visual Planning

After the migration is complete, be sure to keep your code clean by removing the outdated code. If the migration needs more than a few minutes, add a reminder to your team’s task plan. For very long migrations, you can add a reminder to your team calendar or schedule a “finish data migration” story into your team’s visual plan.

This three-step migration process applies to any change to external state. In addition to databases, it also includes configuration settings, infrastructure changes, and third-party service changes. Be very careful when external state is involved, because errors are difficult to undo. Smaller, more frequent changes are typically better than big, infrequent changes.

Prerequisites

Allies
Continuous Integration
Feature Flags
No Bugs
Zero Friction
Build for Operation

To use continuous deployment, your team needs a rigorous approach to continuous integration. You need to integrate multiple times per day and create a known-good, deploy-ready build each time. “Deploy-ready,” in this case, means unfinished features are hidden from users and your code doesn’t need manual testing. Finally, your deploy process needs to be completely automated, and you need a way of automatically detecting deployment failures.

Continuous deployment makes sense only when deployments are invisible to users. Practically speaking, that typically means backend systems and web-based frontends. Desktop and mobile frontends, embedded systems, and so forth usually aren’t a good fit for continuous deployment.

Indicators

When your team deploys continuously:

  • Deploying to production becomes a stress-free nonevent.

  • When deployment problems occur, they’re easily resolved.

  • Deploys are unlikely to cause production issues, and when they do, they’re usually quick to fix.

Alternatives and Experiments

The typical alternative to continuous deployment is release-oriented deployment: deploying only when you have something ready to release. Continuous deployment is actually safer and more reliable, once the preconditions are in place, even though it sounds scarier at first.

You don’t have to switch from release-oriented deployment directly to continuous deployment. You can take it slowly, starting out by writing a fully automated deploy script, then automatically deploying to a staging environment as part of continuous integration, and finally moving to continuous deployment.

In terms of experimentation, the core ideas of continuous deployment are to minimize work in progress and speed up the feedback loop (see the “Key Idea: Minimize Work in Progress” sidebar and the “Key Idea: Fast Feedback” sidebar). Anything that you can do to speed up that feedback loop and decrease the time required to deploy is moving in the right direction. For extra points, look for ways to speed up the feedback loop for release ideas, too.

Further Reading

The DevOps Handbook [Kim2016] is a thorough look at all aspects of DevOps, including continuous deployment, with a wealth of case studies and real-world examples.

“Migrating bajillions of database records at Stripe” [Heaton2015] is an interesting and entertaining example of incremental data migration.

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

AoAD2 Practice: Continuous Integration

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Continuous Integration

Audience
Programmers, Operations

We keep our latest code ready to release.

Most software development has a hidden delay between the team saying, “We’re done” and when it's actually ready to release. Sometimes that delay can stretch on for months. It’s the little things: getting everybody’s code to work together, writing a deploy script, pre-populating the database, and so forth.

When on-site customers are ready to release, you push a button and release.

Continuous integration is a better approach. Teams using continuous integration keep everyone’s code working together and ready to release. The ultimate goal of continuous integration is to make releasing a business decision, not a technical decision. When on-site customers are ready to release, you push a button and release. No fuss, no muss.

Allies
Collective Code Ownership
Refactoring

Continuous integration is also essential for collective code ownership and refactoring. If everybody is making changes to the same code, they need a way to share their work. Continuous integration is the best way to do so.

Continuous Integration Is a Practice, Not a Tool

One of the early adopters of continuous integration was ThoughtWorks, a software development outsourcing firm. They built a tool called “CruiseControl” to automatically run their continuous integration scripts. They called it a continuous integration (CI) server, also known as a CI/CD server or build server.

Since then, the popularity of these tools has exploded. They’re so popular, the tools have taken over from the actual practice. Today, many people think “continuous integration” means using a CI server.

Continuous integration is about much more than running a build.

It’s not true. CI servers handle only one small part of continuous integration: they build and merge code on cue. But continuous integration is about much more than running a build. Fundamentally, it’s about being able to release your team’s latest work at will. No tool can do that for you. It requires three things:

Integrate many times per day

Integration means merging together all the code the team has written. Typically, it involves merging everyone’s code into a common branch of your source code repository. That branch goes by a variety of names: “main,” “master,” and “trunk” are common. I use “integration,” because I like clear names, and that’s what the branch is for. But you can use whatever name you like.

Teams practicing continuous integration integrate as often as possible. This is the “continuous” part of continuous integration. People integrate every time they complete a task, before and after every major refactoring, and any time they’re about to switch gears. The elapsed time can be anywhere from a few minutes to a few hours, depending on the work. The more often, the better. Some teams even integrate with every commit.

If you’ve ever experienced a painful multiday merge, integrating so often probably seems foolish. Why go through that pain?

The secret of continuous integration is that it actually reduces the risk of a bad merge. The more often you integrate, the less painful it is. More frequent integrations mean smaller merges, and smaller merges mean less chance of merge conflicts. Teams using continuous integration still have occasional merge conflicts, but they’re rare and easily resolved.

Never break the integration build
The integration branch must always build and pass its tests.

When was the last time you spent hours chasing down a bug in your code, only to find that it wasn’t your code at all, but an out-of-date configuration, or somebody else’s code? Conversely, when was the last time you spent hours blaming a problem on your configuration or somebody else’s code, only to find that it was your code all along? To prevent these problems, the integration branch needs to be known-good. It must always build and pass its tests.

Allies
Zero Friction
Test-Driven Development
Fast Reliable Tests

This is actually easier than you might think. You’ll need an automated build with a good suite of tests, but once you have that, guaranteeing a known-good integration branch is just a matter of validating the merged code before promoting it to the integration branch. That way, if the build fails, the integration branch remains in its previous, known-good state.

However, the build must be fast, finishing in less than 10 minutes. If it isn’t, it’s too hard to share code between team members. You can work around a slow build with multistage integration, as I discuss in the “Multistage Integration Builds” section.

Keep the integration branch ready to release

Every integration should get as close to a real release as possible. The goal is to make preparing for release such an ordinary occurrence that, when you actually do release, it’s a nonevent. One team I worked with got to the point that it was releasing multiple times per week. Team members wrote a small mobile app with a big red button. When they were ready to release, they’d go to the local pub, order a round, and push the button.

Allies
Done Done
Build for Operation
Feature Flags

This means that every story includes tasks to update the build and deployment scripts, when needed. Code changes are accompanied by tests. Code quality problems are addressed. Data migrations are scripted. Important but invisible stories such as logging and auditing are prioritized alongside their features. Incomplete work is hidden behind feature flags or keystones.

“Getting as close as possible to a real release” includes running the deployment scripts and seeing them actually work. You don’t need to deploy to production—that’s continuous deployment, a more advanced practice—but you should deploy to a test environment. The same goes for software that isn’t online. If you’re building embedded software, install it to test hardware or a simulator. If you’re building a mobile app, create a submission package. If you’re building a desktop app, build an install package.

Don’t save the grunt work for the end. (See the “Key Idea: Minimize Work in Progress” sidebar.) Take care of it continuously, throughout development. From the very first day, focus on creating a walking skeleton that could be released, if it only had a bit more meat on its bones, and steadily add to it with every story and task.

The Many Flavors of Continuous Integration

Continuous integration is so popular, and so misunderstood, people keep coming up with new terms for different aspects of the underlying idea:

  • CI server. A tool that automatically runs build scripts. Not continuous integration at all.

  • Trunk-based development. Emphasizes the “integration” part of continuous integration. [Hammant2020]

  • Continuous delivery. Emphasizes the “deploy” part of continuous integration. [Humble2010] Commonly thought of as “continuous integration + deploy to test environment.”

  • Continuous deployment. A genuinely new practice. It deploys to production with every integration. Commonly thought of as “continuous delivery + deploy to production.”

Although continuous delivery is often seen as a separate practice from continuous integration, Kent Beck described it as part of continuous integration way back in 2004:

Integrate and build a complete product. If the goal is to burn a CD, burn a CD. If the goal is to deploy a web site, deploy a web site, even if it is to a test environment. Continuous integration should be complete enough that the eventual first deployment of the system is no big deal. [Beck2004] (ch. 7)

The Continuous Integration Dance

When you use continuous integration, every day follows a little choreographed dance:

  1. Sit down at a development workstation and reset it to a known-good state.

  2. Do work.

  3. Integrate (and possibly deploy) at every good opportunity.

  4. When you’re finished, clean up.

Ally
Zero Friction

These steps should all be automated as part of your zero-friction development environment.

For step 1, I make a script called reset_repo, or something similar. With git, the commands look like this (before error handling):

git clean -fdx                       # erase all local changes
git fetch -p origin                  # get latest code from repo, removing outdated branches
git checkout integration             # switch to integration branch
git reset --hard origin/integration  # reset integration branch to match repo
git checkout -b $PRIVATE_BRANCH      # create a private branch for your work
$BUILD_COMMAND_HERE                  # verify that you’re in a known-good state

During step 2, you'll work normally, including committing and rebasing however your team prefers.

Step 3 is to integrate. You can do so any time the tests are passing. Try to integrate at least every few hours. When you’re ready to integrate, you’ll merge the latest integration branch changes into your code, make sure everything works together, then tell your CI server to test your code and merge it back into the integration branch.

Your integrate script will automate these steps for you. With git, it looks like this (before error handling):

git status --porcelain         # check for uncommitted changes (fail if any)
git pull origin integration    # merge integration branch into local code
$BUILD_COMMAND_HERE            # build, run tests (to check for merge errors)
$CI_COMMAND_HERE               # tell CI server to test and merge code

# The following steps help git resolve merge conflicts
$WAIT_COMMAND_HERE             # wait for CI server to finish
git checkout integration       # check out integration branch
git pull origin integration    # update integration branch from repo
git checkout $PRIVATE_BRANCH   # check out private branch
git merge integration          # merge repo's integration branch changes

The CI command varies according to your CI server, but will typically involve pushing your code to the repository. Be sure to set up your CI server to build and test your code before merging back to the integration branch, not after. That way your integration branch is always in a known-good state. If you don’t have a CI server that can do that, you can use the script in the next section instead.

Repeat steps 2 and 3 until you’re done for the day. After you integrate the final time, clean up:

git clean -fdx                       # erase all local changes
git checkout integration             # switch to integration branch
git branch -d $PRIVATE_BRANCH        # delete private branch
git fetch -p origin                  # get latest code from repo, removing outdated branches
git reset --hard origin/integration  # reset integration branch to match repo

These scripts are only suggestions—feel free to customize them to match your team’s preferences.

Continuous Integration Without a CI Server

It’s surprisingly easy to perform continuous integration without a CI server. In some environments, this may be your best option, as cloud-based CI servers can be woefully underpowered. All you need is an integration machine—a spare development workstation or virtual machine—and a small script.

To start with, program the integrate script you run on your development workstation to push your changes to a private branch. The git command is git push origin HEAD:$PRIVATE_BRANCH.

After the code has been pushed, manually log in to the integration machine and run a second integration script. It should check out the private branch, double-check that nobody else has integrated since you pushed it, run the build and tests, then merge the changes back into the integration branch.

Running the build and tests on a separate integration machine is essential for ensuring a known-good integration branch. It prevents “it worked on my machine” errors. With git, the commands look like this (before error handling):

# Get private branch
git clean -fdx                     # erase all local changes
git fetch origin                   # get latest code from repo
git checkout $PRIVATE_BRANCH       # check out private branch
git reset --hard origin/$PRIVATE_BRANCH   # reset private branch to match repo

# Check private branch
git merge integration --ff-only    # ensure integration branch has been merged
$BUILD_COMMAND_HERE                # build, run tests

# Merge private branch to integration branch using merge commit
git checkout integration           # check out integration branch
git merge $PRIVATE_BRANCH --no-ff --log=500 -m "INTEGRATE: $MESSAGE"  # merge
git push                           # push changes to repo

# Delete private branch
git branch -d $PRIVATE_BRANCH      # delete private branch locally
git push origin :$PRIVATE_BRANCH   # delete private branch from repo

If the script fails, fix it on your development machine and then integrate again. With this script, failed integrations don’t affect anyone else.

Note that only one person can integrate at a time, so you’ll need some way to control access. If you have a physical integration machine, whoever is sitting at the integration machine wins. If your integration machine is remote, you can configure it to allow only one login at a time.

This script is meant for synchronous integration, which means you have to wait for the integration to complete before doing other work. (I’ll explain more in a moment.) If you need asynchronous integration, you’re better off using a CI server. Multistage builds can use this script for the synchronous portion, for speed, then hand off to a CI server for the secondary build or deployment.

Synchronous Versus Asynchronous Integration

Allies
Zero Friction
Fast Reliable Tests

Continuous integration works best when you wait for the integration to complete. This is called synchronous integration, and it requires your build and tests to be fast—preferably completing in less than 5 minutes, or 10 minutes at most. Achieving this speed is usually a matter of creating fast, reliable tests.

If the build takes too long, you’ll have to use asynchronous integration instead. In asynchronous integration, which requires a CI server, you start the integration process, then go do other work while the CI server runs the build. When the build is done, the CI server notifies you of the result.

Asynchronous integration sounds efficient, but it turns out to be problematic in practice. You check in the code, start working on something else, and then half an hour (or more) later, you get a notification that the build failed. Now you have to interrupt your work and go fix the problem. In theory, anyway. More often, it gets set aside until later. You end up with a chunk of work that’s hours or even days out of date, with much more likelihood of merge conflicts.

It’s a particular problem with poorly configured CI servers. Although your CI server should merge code to the integration branch only after the build succeeds, so the integration branch is known-good, some CI servers default to merging the code first, then running the build afterward. If the code breaks the build, then everybody who pulls from the integration branch is blocked.

Combine that with asynchronous integration, and you end up with a situation where people unwittingly check in broken code and then don’t fix it because they assume somebody else broke the build. The situation compounds, with error building on error. I’ve seen teams whose builds remained broken for days on end.

It’s better to make it structurally impossible to not break the build by testing the build first. It’s better still to use synchronous integration. When you integrate, wait for the integration to succeed. If it doesn’t, fix the problem immediately.

Multistage Integration Builds

Some teams have sophisticated tests, measuring qualities such as performance, load, or stability, that simply cannot finish in under 10 minutes. For these teams, multistage integration is a good idea.

A multistage integration consists of two separate builds. The normal build, or commit build, contains all the items necessary to demonstrate that the software works: compiling, linting, unit tests, narrow integration tests, and a handful of smoke tests. This build runs synchronously, as usual.

When the commit build succeeds, the integration is considered to be successful, and the code is merged to the integration branch. Then a slower secondary build runs asynchronously. It contains the additional tests that don’t run in a normal build: performance tests, load tests, stability tests, and so forth. It can also include deploying the code to staging or production environments.

If the secondary build fails, everyone stops what they’re doing to fix the problem.

If the secondary build fails, the team is notified, and everyone stops what they’re doing to fix the problem. This ensures the team gets back to a known-good build quickly. However, failures in the secondary build should be rare. If they’re not, the commit build should be enhanced to detect those types of problems, so they can be fixed synchronously.

Although a multistage build can be a good idea for a mature codebase with sophisticated testing, most teams I encounter use multistage integration as a workaround for a slow test suite. In the long-term, it’s better to improve the test suite instead.

In the short term, introducing a multistage integration can help you transition from asynchronous to synchronous integration. Put your fast tests in the commit build and your slow tests in the secondary build. But don’t stop there. Keep improving your tests, with the goal of eliminating the secondary build and running your integration synchronously.

Pull Requests and Code Reviews

Pull requests are too slow for continuous integration.

Pull requests aren’t a good fit for continuous integration. They’re too slow. Continuous integration works best when the time between integrations is very short—less than a few hours—and pull requests tend to take a day or two to approve. This makes merge conflicts much more likely, especially for teams using evolutionary design (which I’ll discuss in the “Design” chapter).

Allies
Pair Programming
Mob Programming

Instead, use pairing or mobbing to eliminate the need for code review. Alternatively, if you want to keep code reviews, you can conduct code reviews after integrating, rather than as a pre-integration gate.

Although pull requests don’t work well on teams using continuous integration, they can still work as a coordination mechanism between teams that don’t share ownership.

Questions

You said we should clean up at the end of the day, but what if I have unfinished work and can’t integrate?

Allies
Feature Flags
Test-Driven Development

If you’re using feature flags and practicing test-driven development, you can integrate any time your tests are passing, which should be every few minutes. You shouldn’t ever be in a position where you can’t integrate.

If you’ve gotten stuck, it might be a good idea to delete the unfinished code. If you’ve been integrating frequently, there won’t be much. You’ll do a better job with a fresh start in the morning.

Isn’t synchronous integration a waste of time?

No, not if your build is as fast as it should be. It’s a good opportunity to take a break, clear your head, and think about design, refactoring opportunities, or next steps. In practice, the problems caused by asynchronous integration take more time.

We always seem to run into merge conflicts when we integrate. What are we doing wrong?

One cause of merge conflicts is infrequent integration. The less often you integrate, the more changes you have to merge. Try integrating more often.

Another possibility is that your changes are overlapping with other team members’ work. Try talking more about what you’re working on and coordinating more closely with the people that are working on related code. See the “Making Collective Ownership Work” section for details.

The CI server (or integration machine) constantly fails the build. How can we integrate more reliably?

Ally
Fast Reliable Tests

First, make sure you have reliable tests. Intermittent test failures are the most common reason I see for failed builds. If that isn’t the problem, you might need to merge and test your code locally before integrating. Alternatively, if you have frequent problems with incorrect dependencies, you might need to put more work into reproducible builds, as described in the “Reproducible Builds” section.

Prerequisites

Allies
Zero Friction
Pair Programming
Mob Programming
Test-Driven Development
Fast Reliable Tests

Continuous integration works best with synchronous integration, which requires a zero-friction build that takes less than 10 minutes to complete. Otherwise, you’ll have to use asynchronous integration or multistage integration.

Asynchronous and multistage integration require the use of a CI server, and that server should be configured so that it validates the build before it merges changes to the integration branch. Otherwise, you’re likely to end up with compounding build errors.

Pull requests don’t work well with continuous integration, so another approach to code review is needed. Pairing or mobbing work best.

Continuous integration relies on a build and test suite that thoroughly tests your code, preferably with fast, reliable tests. Test-driven development using narrow, sociable tests is the best way to achieve this.

Indicators

When you integrate continuously:

  • Deploying and releasing is painless.

  • Your team experiences few integration conflicts and confusing integration bugs.

  • Team members can easily synchronize their work.

  • Your team can release with the push of a button, whenever your on-site customers are ready.

Alternatives and Experiments

Allies
Collective Code Ownership
Reflective Design
Refactoring
Feature Flags

Continuous integration is essential for teams using collective code ownership and evolutionary design. Without it, significant refactoring becomes impractical, because it causes too many merge conflicts. That prevents the team from continuously improving the design, which is necessary for long-term success.

The most common alternative to continuous integration is feature branches, which merge from the integration branch on a regular basis, but only integrate to the integration branch when each feature is done. Although feature branches allow you to keep the integration branch ready to release, they usually don’t work well with collective code ownership and evolutionary design, because merges to the integration branch are too infrequent. Feature flags are a better way to keep the integration branch ready to release.

Ally
Continuous Deployment

The experiments I’ve seen around continuous integration involve taking it to further extremes. Some teams integrate on every commit—every few minutes—or even every time the tests pass. The most popular experiment is continuous deployment, which has entered the mainstream, and is discussed later in this book.

Further Reading

Martin Fowler’s article, “Patterns for Managing Source Code Branches,” [Fowler2020b] is an excellent resource for people interested in digging into the differences between feature branches, continuous integration, and other branching strategies.

Continuous Delivery, by Jez Humble and David Farley [Humble2010], is a classic, and rightfully so. It’s a thorough discussion of everything you need for continuous integration, with an emphasis on deployment automation.

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

Agile Book Club: Situational Awareness (with Jason Yip)

For a team to be more than just a conglomeration of people, they need to be able to work together effectively. Key to this collaboration is situational awareness: knowing what people are working on and when your assistance is needed. Stand-up meetings and informative workspaces are two ways of achieving situational awareness. Jason Yip joins us for an in-depth discussion.

Jason Yip is a Staff Agile Coach at Spotify, and formerly a Principal Consultant at ThoughtWorks. He specializes in applying Agile and Lean thinking, principles, and practices to the development software, teams, and organizations. He’s the author of the popular article, “It’s Not Just Standing Up: Patterns for Daily Standup Meetings,” which may be found on martinfowler.com.

Reading:
📖 Stand-Up Meetings
📖 Informative Workspace

🎙 Discussion prompts:

  • What does situational awareness within a team mean to you, and what does it look like when it’s at its best?

  • How do stand-up meetings contribute to effective teamwork? How can they detract from it?

  • What can remote teams do to improve their situational awareness, or teamwork in general?

  • If a team is working as a collection of individuals, rather than a collaborative team, what can managers and coaches do to help?

About the Book Club

The Art of Agile Development Book Club takes place Fridays from 8:00 – 8:45am Pacific. Each session uses an excerpt from the new edition of my book, The Art of Agile Development, as a jumping-off point for a wide-ranging discussion about Agile ideas and practices.

Visit the event page for more information, including an archive of past sessions. For more about the book, visit the Art of Agile Development home page.

Scaling Organizations and Design

I joined the Mob Mentality Show last month for a wide-ranging discussion about organizations and design. We had a great conversation about my experiences with FAST (Fluid Scaling Technology), evolutionary design, and testing without mocks. I think it turned out particularly well. Check it out:

AoAD2 Practice: Stakeholder Demos

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Stakeholder Demos

Audience
Product Managers, Whole Team

We keep it real.

Agile teams can produce working software every week, starting from their very first week. This may sound impossible, but it’s not; it’s merely difficult. And the key to learning how to do it well is feedback.

Stakeholder demos are a powerful way of providing your team with the feedback it needs. They’re just what they sound like: a demonstration, to key stakeholders, of what your team has completed recently, along with a way for stakeholders to try the software for themselves.

Feedback Loops

Allies
Incremental Requirements
Real Customer Involvement

Stakeholder demos provide feedback in multiple ways. First, the obvious: stakeholders will tell you what they think of your software.

Although this feedback is valuable, it’s not the most valuable feedback you get from a stakeholder demo. The team’s on-site customers work with stakeholders throughout development, so they should already know what stakeholders want and expect.

So the real feedback provided by stakeholder comments is not the feedback itself, but how surprising that feedback is. If you’re surprised, you’ve learned that you need to work harder to understand your stakeholders.

Another type of feedback is the reactions of the people involved. If team members are proud of their work and stakeholders are happy to see it, that’s a good sign. If team members aren’t proud, or are burned out, or stakeholders are unhappy, something is wrong.

The people who attend are another form of feedback. If there are people attending whom you didn’t consider stakeholders, consider reaching out to them to learn more, especially if they’re active participants. Similarly, if there are people who you expected to be vitally interested in your work, and they’re not present, it’s a good idea to learn why.

The demo itself is a “rubber meets the road” moment for the team. It gives you feedback about your team’s ability to finish its work. It’s harder to fool yourself into thinking work is done when you can’t demo it to stakeholders.

Ally
Stakeholder Trust

Finally, the demo provides feedback to stakeholders, too. It shows them that your team is accountable: that you’re listening to their needs and making steady progress. This is vital for helping stakeholders trust that your team has their best interests at heart.

The Demo Cadence

Start by conducting a stakeholder demo every week, or every iteration, if you use iterations that are longer than a week. Always conduct the demo at the same time and place. This will help you establish a rhythm, make it easier for people to attend, and show strong momentum right from the start.

Ally
Real Customer Involvement

Unless your team’s work is secret, invite anybody in your company who might be interested. The whole team, key stakeholders, and executive sponsor should attend as often as possible. Include real customers when appropriate. Other teams working nearby and people who are curious about Agile are welcome as well.

If you use iterations, conduct the demo immediately after the iteration ends. I like to have mine first thing the following morning. This will help your team stay disciplined, because you won’t be able to stretch work into the next iteration.

The demo should typically be scheduled for 30 minutes. It can be longer, but your most important stakeholders have a lot of demands on their time, so it’s better to plan short meetings so they can attend regularly. Let their interest and availability guide your decision. Remember that you can always follow up with people after the demo, too.

Ally
Feature Flags

In addition to the demo presentation, provide a way for stakeholders to try the demo on their own. This might take the form of a staging server, or, if you’re using feature flags, special permissions on stakeholders’ accounts.

After you’ve conducted several demos and the excitement of the new work dies down, you’re likely to find that a weekly demo is too frequent for some of your key stakeholders. You can start holding the demo every two weeks instead, or even once a month. Don’t wait longer than that, though. It’s too infrequent for good feedback. Regardless of the frequency of your presentations, continue to share demo software every week or iteration.

How to Conduct a Stakeholder Demo

Anybody on the team can lead the stakeholder demo. The best person to do so is whoever works most closely with stakeholders—typically, the team’s product manager. They’ll speak stakeholders’ language and have the best understanding of their point of view. It also emphasizes how stakeholder needs steer the team’s work.

Product managers often ask developers to lead the demo instead. I see this most often when the product manager doesn’t see themselves as part of the team, or doesn’t feel that they know the software well. Push back on this request. Developers aren’t building the software for the product manager; the whole team, including the product manager, is building the software for stakeholders. The product manager is the face of that effort, so they should lead the demo. Help the product manager be more involved and comfortable by reviewing stories with them as they’re built.

The prepared portion of the demo should be less than 10 minutes. You don’t need to show every detail. As you present, allow interruptions for questions and feedback, but keep an eye on the clock so that you end on time. If you need more time because you’re getting a lot of feedback, that’s a sign that you should conduct demos more often. On the other hand, if you’re having trouble attracting attendees, or they don’t seem interested, conducting demos less often may give you more meat to share.

If you have a particularly large audience, you may need to set some ground rules about questions and interruptions to prevent the demo from taking too long.

Because the meeting is so short, it’s good to start on time, even if some attendees are late. This will send the message that you value attendees’ time. Both the presenter and demo should remain available for further discussion and exploration after the meeting.

Begin the presentation by briefly reminding attendees about the valuable increment the team is currently working on and why it’s the most important use of the team’s time. Set the stage and provide context for people who haven’t been paying full attention. Then provide an overview of the stories the team worked on since the last demo.

Calmly describe problems and how you handled them.

If you’ve made any changes to your plan that stakeholders will care about, explain what happened. Don’t sugarcoat or gloss over problems. Full disclosure will raise your credibility. By neither simplifying nor exaggerating problems, you demonstrate your team’s ability to deal with problems professionally. For example:

Demonstrator: In the past two weeks, we’ve been focusing on adding polish to our flight reservation system. It’s already complete, in that we could release it as-is, but we’ve been adding “delighters” to make it more impressive and usable for our customers.

We finished all the stories we had planned, but we had to change the itinerary visualization, as I’ll show you in a moment. It turned out to be too expensive, so we had to find another solution. It’s not exactly what we had planned, but we’re happy with the result.

After your introduction, go through the stories the team worked on. Rather than literally reading each story, paraphrase them to provide context. It’s okay to combine stories or gloss over details that stakeholders might not be interested in. Then demonstrate the result in the software. Stories without a user interface can be glossed over or just described verbally.

Demonstrator: Our first two stories involved automatically filling in the user’s billing information if they’re logged in. First, I’ll log in with our test user...click “reservations”...and there, you can see that the billing information fills in automatically.

Audience member: What if they change their billing information?

Demonstrator: Then we ask them if they want to save the changed information. (Demonstrates.)

Stakeholders will often have feedback. Most of the time, the feedback will be minor. If it’s a substantial change in direction, think about how you can better engage your stakeholders during development next time, so that you’re not surprised. Either way, make a note of the suggestion and promise to follow up.

Audience member: Does it alert customers when their saved billing information is out of date?

Demonstrator: Not at present, but that’s a good idea. (Makes note.) I’ll look into it and get back to you.

If you come to a story that didn’t work out as planned, provide a straightforward explanation. Don’t be defensive; simply explain what happened.

Demonstrator: Our next story involves the itinerary visualization. As I mentioned, we had to change our plans for this. You may remember that our original plan was to show flight segments with an animated 3D fly-through. Programmers had some concerns about performance, so they did a test, and it turned out that rendering the animation would be a big hit on our cloud costs.

Audience member: Why is it so expensive? (Demonstrator motions to a programmer.)

Programmer: Some mobile devices don’t have the ability to render 3-D animation in the browser, or can’t do it smoothly. So we would have had to do it in the cloud. But cloud GPU time is very expensive. We could have built a cloud version and a client-side version, or maybe cached some of the animations, but we’d need to take a close look at usage stats before we could say how much that would help.

Demonstrator: This was always a nice-to-have, and the increased cloud costs weren’t worth it. We didn’t want to spend extra development time on it either, so we dialed it back to a normal 2D map. None of our competitors have a map of flight segments at all. We didn’t have enough time left over to animate the map, but after seeing the result (demonstrates), we decided that this was a nice, clean look. We’re going to move on rather than spending more time on it.

Once the demo is complete, tell stakeholders how they can run the software themselves. This is a good way of wrapping up if the demo is running long: let the audience know how they can try it for themselves, then ask if anybody would like a private follow-up for more feedback or questions.

Be Prepared

Ally
Done Done

Before the demo, make sure all the stories being demoed are “done done” and that you have a version of the software that includes them. Make sure attendees have a way to try the demo for themselves.

You don’t need to create a polished presentation with glitzy graphics for the demo, but you still need to be prepared. You should be able to present the demo in 5–10 minutes, so that means knowing your material and being concise.

Allies
Purpose
Visual Planning

To prepare, review the stories that have been finished since the last demo and organize them into a coherent narrative. Decide which stories can be combined for the purpose of your explanation. Look at your team’s purpose and visual plan and decide how each set of stories connects to your current valuable increment, your upcoming release, and the team’s overall mission and vision. Create an outline of what you want to say.

Finally, conduct a few rehearsals. You don’t need a script—speaking off the cuff sounds more natural—but you do want to be practiced. Walk through the things you’re planning to demonstrate to make sure everything works the way you expect and all your example data is present. Then practice what you’re going to say. Do this a few times until you’re calm and confident.

Each demo will take less and less preparation and practice. Eventually, it will become second nature, and preparing for it will only take a few minutes.

When Things Go Wrong

Sometimes, things just don’t work out. You won’t have anything to show, or what you do have will be disappointing.

It’s very tempting in this situation to fake the demo. You might be tempted to show a UI that doesn’t have any logic behind it, or purposefully avoid showing an action that has a significant defect.

Be clear about your software’s limitations and what you intend to do about them.

It’s hard, but you need to be honest about what happened. Be clear about your software’s limitations and what you intend to do about them. Faking progress leads stakeholders to believe you have greater capacity than you actually do. They’ll expect you to continue at the inflated rate, and you’ll steadily fall behind.

Instead, take responsibility as a team (rather than blaming individuals or other groups), try not to be defensive, and let stakeholders know what you’re doing to prevent the same thing from happening again. Here’s an example:

This week, I’m afraid we have nothing to show. We planned to show you live flight tracking, but we underestimated the difficulty of interfacing with the backend airline systems. We expected the data to be cleaner than it is, and we didn’t realize we’d need to build out our own test environment.

We identified these problems early on, and we thought we could work around them. We did, but not in time to finish anything we can show you. We should have replanned around smaller slices of functionality so we could still have something to show. Now we know, and we’ll be more proactive about replanning next time.

We expect similar problems with the airline systems in the future. We’ve had to add more stories to account for the changes. That’s used up most of our buffer. We’re still on target for the go-live marketing date, but we’ll have to cut features if we encounter any other major problems between now and then.

I’m sorry for the bad news and I’m available to answer your questions. I can take some now, and we’ll have more information after we finish revising our plans later this week.

Questions

What do we do if stakeholders keep interrupting and asking questions during the demo?

Questions and interruptions are wonderful. It means stakeholders are engaged and interested.

If you’re getting so many interruptions and questions that you have trouble sticking with the 30-minute time limit, you might need to hold demos more often. Otherwise—especially if it’s just one particularly engaged individual—you can ask them to hold further questions until after the meeting. It’s also okay to plan for meetings longer than 30 minutes, especially in the first month or two.

What do we do if stakeholders keep nitpicking our choices?

Nitpicking is also normal, and a sign of interest, when you start giving demos. Don’t take it too personally. Write the ideas down on cards, as with any story, and prioritize them after the meeting. Resist the temptation to address, prioritize, or begin designing solutions in the meeting. Not only does this extend the meeting, it avoids the discipline of the normal planning practices.

If nitpicking continues after the first month or two, it may be a sign that on-site customers are missing something. Take a closer look at the complaints to see if there’s a deeper problem.

Stakeholders are excited by what they see and want to add a bunch of features. They’re good ideas, but we need to move on to something else. What should we do?

Don’t say “no” during the demo. Don’t say “yes,” either. Simply thank the stakeholders for their suggestions and write them down as stories. After the demo is over, on-site customers should take a close look at the suggestions and their value relative to the team’s purpose. If they don’t fit into the team’s schedule, a team member with product management skills can communicate that back to stakeholders.

What if people don’t come to the demo, or aren’t engaged?

You may be holding the demo too frequently, or taking too long. Try conducting the demo less frequently and practicing speaking concisely. It’s also possible that people don’t see your software as relevant to them. Ask key stakeholders what you can do to make the demo more useful and relevant.

What if we have multiple teams working on the same software?

It might make sense to combine the teams’ work into a single demo. In that case, choose one person to show everyone’s work. This requires cross-team coordination, which is out of the scope of this book, but you should have cross-team coordination in place as part of your scaling approach. See the “Scaling Agility” chapter for ideas.

If it doesn’t make sense to combine the teams’ work into a single demo, then you can continue to have separate demos. Some organizations prefer to schedule them all together in one large meeting, but that doesn’t scale well. Instead, create multiple combined demos—for example, one for customer-facing teams, one for internal administration, one for development support, etc.—and schedule them so that people can pick and choose which they attend.

Prerequisites

Never fake a stakeholder demo by hiding bugs or showing a story that isn’t complete. You’ll just set yourself up for trouble down the line.

Allies
Task Planning

If you can’t demonstrate progress without faking it, it’s a clear sign that your team is in trouble. Slow down and try to figure out what’s going wrong. If you aren’t using iterations, try using them. If you are, see the “Making and Meeting Iteration Commitments” section and ask a mentor for help. The problem may be as simple as trying to do too much in parallel.

Indicators

When your team conducts stakeholder demos well:

  • You generate trust with stakeholders.

  • You learn what stakeholders are most passionate about.

  • The team is confident in its ability to deliver.

  • You’re forthright about problems, which allows your team to prevent them from ballooning out of control.

Alternatives and Experiments

Stakeholder demos are a clear indication of your ability to deliver. Either you have completed stories to demonstrate, or you don’t. Either your stakeholders are satisfied with your work, or they’re not. I’m not aware of any alternatives that provide such valuable feedback.

And it’s feedback that’s the important part of the stakeholder demo. Feedback about your team’s ability to deliver, feedback about your stakeholders’ satisfaction, and also the feedback you get from observing stakeholders’ responses and hearing their questions and comments.

As you experiment with stakeholder demos, be sure to keep that feedback in mind. The demo isn’t just a way of sharing what you’re doing. It’s also a way of learning from your stakeholders. Some teams streamline their demos by creating a brief video recording. It’s a clever idea, and worth trying. But it doesn’t give you as much feedback. Be sure any experiments you try include a way to confirm your ability to complete work and learn from your stakeholders.

Some teams reverse the demo: rather than showing stakeholders what they’ve done, they observe stakeholders as they try the software themselves. This works best in an in-person setting. It also works well when you have multiple in-person teams: you can put them all in the same large room and conduct a “bazaar” or “trade show” style demo, where stakeholders can move from team to team and see their work.1

1I learned about this approach from Bas Vodde. The “bazaar” approach is loosely based on the “science fair” technique discussed in [Schatz2004].

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

AoAD2 Practice: Stakeholder Trust

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Stakeholder Trust

Audience
Product Managers, Whole Team

We work with our stakeholders effectively and without fear.

I know somebody who worked in a company with two development teams. One was Agile, met its commitments, and delivered regularly. The team next door struggled: it fell behind schedule and didn’t have any working software to show. Yet when the company downsized, they let the Agile team go rather than the other team!

Why? When management looked in on the struggling team, they saw formal diagrams papering the walls and programmers working long hours. When they looked in on the Agile team, they saw people talking, laughing, and going home at five with nothing but rough sketches and charts on the whiteboards.

Like it or not, your team doesn’t exist in a vacuum. Agile can seem strange and different at first. “Are they really working?” outsiders wonder. “It’s noisy and confusing. I don’t want to work that way. If it succeeds, will they force me to do it, too?”

Ironically, the more successful Agile is, the more these worries grow. Alistair Cockburn calls them organizational antibodies. (He credits Ron Holiday with the term.) If left unchecked, organizational antibodies will overcome and dismantle an otherwise successful Agile team.

No matter how effective you are, you’re in trouble without the goodwill of your stakeholders.

No matter how effective you are at delivering software, you’re in trouble without the goodwill of your stakeholders and sponsor. Yes, delivering software and meeting technical expectations helps, but the interpersonal skills your team exhibits may be just as important to building trust in your team.

Does this sound unfair or illogical? Surely your ability to deliver high-quality software is all that really matters!

It is unfair. It is illogical. It’s also the way people think. If your stakeholders don’t trust you, they won’t collaborate with your team, which hurts your ability to deliver valuable software. They might even campaign against you.

Don’t wait for your stakeholders to realize how your work can help them. Show them.

Show Some Hustle

Many years ago, I hired a small local moving company to move my belongings from one apartment to another. When the movers arrived, I was impressed to see them hustle—they moved as quickly as possible from the van to the apartment and back. This was particularly unexpected because I was paying them by the hour. There was no advantage for them to move so quickly.

Those movers impressed me. I felt that they were dedicated to meeting my needs and respecting my pocketbook. If I still lived in that city and needed to move again, I would hire them in an instant. They earned my goodwill—and my trust.

Allies
Energized Work
Informative Workspace
Stakeholder Demos
Roadmaps

In the case of a software team, hustle is energetic, productive work. It’s the sense that the team is putting in a fair day’s work for a fair day’s pay. Energized work, an informative workspace, stakeholder demos, and appropriate roadmaps all help convey this feeling of productivity. Perhaps most important, though, is attitude: during work hours, treat work as a welcome priority that deserves your full attention, not a burden to be avoided.

Show Some Empathy

Development teams often have contentious relationships with key business stakeholders. From the perspective of developers, it takes the form of unfair demands and bureaucracy, particularly in the form of imposed deadlines and schedule pressure.

So you might be surprised to learn that, for many of those stakeholders, developers are the ones holding all the cards. Stakeholders are in a scary situation, especially in companies that aren’t in the business of selling software. Take a moment to think about what it might be like:

  • Sponsors, product managers, and key stakeholders’ careers are often on the line. Developers’ careers often aren’t.

  • Developers often earn more than stakeholders, apparently without the hard work and toeing of lines that stakeholders have to put in.

  • Developers often come to work much later than stakeholders. They may leave later, too, but stakeholders don’t see that.

  • To outsiders, developers often don’t seem particularly invested in success. They seem to be more interested in things like learning new technologies, preparing for their next job hop, work/life balance, and office perks like ping-pong tables and free snacks.

  • Experienced stakeholders have a long history of developers failing to deliver what they needed at the time that they needed it.

  • Stakeholders are used to developers responding to questions about progress, estimates, and commitments with everything from condescending arrogance to well-meaning but unhelpful technobabble.

  • For many stakeholders, they can see that big tech companies deliver software well, but their company rarely does, and they don’t know why.

Does your team appear to treat stakeholders’ success with the respect it deserves?

I’m not saying developers are bad, or that these perceptions are necessarily true. I’m asking you to think about what success and failure mean to your stakeholders, and to consider whether, from the outside, your team appears to treat stakeholders' success with the respect it deserves.

Deliver on Commitments

If your stakeholders have worked with software teams before, they probably have plenty of war wounds from slipped schedules, unfixed defects, and wasted money. But at the same time, they probably don’t have software development skills themselves. That puts them in the uncomfortable position of relying on your work, having had poor results before, and being unable to tell if your work is any better.

Meanwhile, your team consumes tens of thousands of dollars every month in salary and support. How do stakeholders know whether you’re spending their money wisely? How do they know that the team is even competent?

Stakeholders may not know how to evaluate your process, but they can evaluate results. Two kinds of results speak particularly clearly to them: working software and delivering on commitments. For some people, that’s what accountability means: you did what you said you would.

Allies
Task Planning
Stakeholder Demos
Forecasting

Furthermore, your commitments make it possible for stakeholders to make commitments to their stakeholders. If you have a track record of reliability, you reduce their anxiety. On the other hand, if you fail to meet a commitment and don’t give them advance warning, it’s easy for them to assume you deliberately left them out of the loop.

Fortunately, Agile teams can make reliable commitments. You can use iteration-based task plans to make a commitment every week, and you can demonstrate that you’ve met that commitment, exactly one week later, with a stakeholder demo. You can also use release trains to create a similar cadence for releases, and steer your plans so you always release precisely on time, as described in the “Predefined Release Dates” section.

Week-in, week-out delivery builds stakeholder trust like nothing I’ve ever seen.

This week-in, week-out delivery builds stakeholder trust like nothing I’ve ever seen. It’s extremely powerful. All you have to do is create a plan that you can achieve...and then achieve it. Again and again and again.

Manage Problems

Did I say, “All you have to do”? Silly me. It’s not that easy.

First, you need to plan and execute well (see the “Planning” chapter and the “Ownership” chapter). Second, as the poet said, “The best laid schemes o’ mice an’ men / Gang aft a-gley.”1

1“To a Mouse,” by renowned Scottish poet Robert Burns. The poem starts, “Wee, sleekit, cow’rin, tim’rous beastie, / O, what a panic’s in thy breastie!” Reminds me of how I felt when asked to integrate a year-old feature branch.

In other words, some releases don’t sail smoothly into port on the last day. What do you do when your best laid plans gang a-gley?

Actually, that’s your chance to shine. Anyone can look good when life goes according to plan. Your true character shows when you deal with unexpected problems.

The first thing to do is to limit your exposure to problems. Work on your hardest, most uncertain stories early in the release. You’ll find problems sooner, and you’ll have more time to fix them.

Allies
Stand-Up Meetings
Task Planning
Slack

When you encounter a problem, start by letting the whole team know about it. Bring it up in the next stand-up meeting at the very latest. This gives the entire team a chance to help solve the problem.

Iterations are also a good way to notice when things aren’t going to plan. Check your progress at every stand-up. If the setback is relatively small, you might be able to absorb it by using some of your iteration slack. Otherwise, you’ll need to revise your plans, as described in the “Making and Meeting Iteration Commitments” section.

When you identify a problem you can’t absorb, let key stakeholders know about it. They’ll appreciate your professionalism even if they don’t like the problem. I usually wait until the stakeholder demo to explain problems that we solved on our own, but bring bigger problems to stakeholders’ attention right away. Team members with political savvy should decide who to talk to and when.

It’s not the existence of problems that makes stakeholders most upset—it’s being blindsided by them.

The sooner you disclose a problem, the more time you have to solve it. It reduces panic, too: early on, people are less stressed about deadlines and have more mental energy for problems. Similarly, the sooner your stakeholders know about a problem (and believe me, they’ll find out eventually), the more time they have to work around it. It’s not the existence of problems that makes stakeholders most upset—it’s being blindsided by them.

When you bring a problem to stakeholders’ attention, bring mitigations too, if you can. It’s good to explain the problem, and it’s better to explain what you’re planning to do about it. It can take a lot of courage to have this discussion—but addressing a problem successfully can do wonders for building trust.

Don’t wait to bring up problems just because you don’t have a solution yet. Instead, explain the problem, what you’re doing to come up with mitigations, and when they can expect to hear more.

Beware of the temptation to work overtime or cut slack to make up for lost time. Although this can work for a week or two, it can’t solve systemic problems, and it will create problems of its own if allowed to continue.

Respect Customer Goals

Ally
Team Dynamics

When Agile teams first form, it usually takes individual team members a while to think of themselves as part of a single team. In the beginning, developers and customers often see themselves as separate groups.

New on-site customers tend to be particularly skittish. Being part of a development team feels awkward; they’d rather work in their normal offices with their normal colleagues. Not only that, if on-site customers are unhappy, those colleagues—who often have a direct line to the team’s key stakeholders—will be the first to hear about it.

When forming a new Agile team, make an effort to welcome on-site customers. One particularly effective way to do so is to treat customer goals with respect. This may even mean suppressing, for a time, cynical developer jokes about schedules and suits.

(Being respectful goes both ways, of course, and customers should also suppress their natural tendencies to complain about schedules and argue with estimates. I’m emphasizing customers’ needs here because they play such a big part in stakeholder perceptions.)

Another way for developers to take customer goals seriously is to come up with creative alternatives for meeting those goals. If customers want something that may take a long time or that involves tremendous technical risks, suggest alternate approaches to reach the same underlying goal for less cost. Similarly, if there’s a more impressive way of meeting a goal that customers haven’t considered, bring it up, especially if it’s not too hard.

As the team has these conversations, barriers will be broken and trust will develop. As stakeholders see that, their trust in the team will blossom as well.

You can also build trust directly with stakeholders. Consider this: the next time a stakeholder stops you in the hallway with a request, what would happen if you immediately and cheerfully listened to their request, wrote it down as a story on an index card, and then brought them both to the attention of a product manager for scheduling or further discussion?

This might be a 10-minute interruption for you, but imagine how the stakeholder would feel. You responded to their concern, helped them express it, and took immediate steps to get it into the plan. That’s worth infinitely more to them than firing an email into the black hole of your request tracking system.

Make Stakeholders Look Good

Even if your immediate stakeholders love you, they need to convince their bosses to love you, too. What do your stakeholders need? Think about the situation they’re in, how they’re being evaluated, and what you can do to support them in return.

One possibility is to create a “value book” that business stakeholders can share with their bosses. This is a document, updated regularly, where you write down the value you’ve brought to stakeholders. This helps remind stakeholders what you’ve done for them, and it helps them justify your work to the rest of the organization. For example, “Release X processed 20,000 events in the first two months, reducing error rates by 8%.”

Although this may seem like a marketing exercise—and it is—it’s also a valuable way for the team to stay focused on value. Updating the value book forces team members to reflect on what they’ve done for stakeholders and customers. This helps prevent the team from thinking of itself as a mere factory for delivering stories.

Be Honest

In your enthusiasm to demonstrate progress, be careful not to step over the line. Borderline behavior includes glossing over known defects in a stakeholder demo, taking credit for stories that aren’t 100% complete, and extending an iteration deadline for a few days to finish everything in the iteration plan.

Covering up the truth like this gives stakeholders the impression that you’ve done more than you actually have. They’ll expect you to complete your remaining stories just as quickly, when in fact you haven’t even finished the first set. You’ll build up a backlog of work that looks done, but isn’t. At some point, you’ll have to finish that backlog, and the resulting delay will produce confusion, disappointment, and even anger.

Even scrupulously honest teams can run into this problem. In a desire to look good, teams sometimes sign up for more stories than they can implement well. They get the work done, but they take shortcuts and don’t do enough design and refactoring. The design suffers, defects creep in, and the team finds itself suddenly slowed while they struggle to improve internal quality.

Similarly, don’t yield to the temptation to count partially completed stories toward your capacity. If a story isn’t completely finished, it counts as zero. Don’t take partial credit. There’s an old programming joke: the first 90% of the work takes 90% of the time...and the last 10% of the work takes 90% of the time. Until the story is totally done, it’s impossible to say for certain what percentage has been done.

Questions

Why is it our responsibility to create trust? Shouldn’t stakeholders do their part?

You’re only in charge of yourselves. Ideally, stakeholders are working hard to make the relationship work, too, but that’s not under your control.

Isn’t it more important that we be good rather than look good?

Both are important. Do great work and make sure people know it.

You said developers should keep jokes about the schedule to themselves. Isn’t this just the same as telling developers to shut up and meet the schedule, no matter how ridiculous?

Certainly not. Everybody on the team should speak up and tell the truth when they see a problem. However, there’s a big difference between discussing a real problem and simply being cynical.

Remember that customers’ careers are often on the line. They may not be able to tell the difference between a real joke and a complaint disguised as a joke. An inappropriate joke can set their adrenaline pumping just as easily as a real problem.

Prerequisites

Commitments are a powerful tool for building trust, but only if you meet them. Don’t make commitments to stakeholders before you’ve proven your ability to make and meet commitments privately, within the team. See the “Making and Meeting Iteration Commitments” section for details.

Indicators

When your team establishes trust with your organization and stakeholders:

  • Stakeholders believe in your team’s ability to meet their needs.

  • You acknowledge mistakes, challenges, and problems rather than hiding them until they blow up.

  • Everyone involved seeks solutions rather than blame.

Alternatives and Experiments

Stakeholder trust is vital. There are no alternatives.

There are, however, many ways of building trust. This is a topic with a long history, and the only truly new idea Agile brings to the table is the ability, using iterations, to make and meet commitments on a weekly basis. Other than that, feel free to take inspiration from the many existing resources on relationship building and trust.

Further Reading

Trust and Betrayal in the Workplace [Reina2015] is a thorough look at how to establish trust and what to do when it is broken.

'The Power of a Positive No: How to Say No and Still Get to Yes [Ury2007] describes how to say no while preserving important relationships. Diana Larsen describes this ability as “probably more important than any amount of negotiating skill in building trust.”

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

AoAD2 Chapter: Accountability (introduction)

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Accountability

If Agile teams own their work and their plans, how do their organizations know they’re doing the right thing? How do they know that the team is doing its best possible work, given the resources, information, and people it has?

Organizations may be willing, even eager, for teams to follow an Agile approach, but this doesn’t mean Agile teams have carte blanche authority to do whatever they want. They’re still accountable to the organization. They need to demonstrate that they’re spending the organization’s time and money appropriately.

This chapter has the practices you need to be accountable:

  • The “Stakeholder Trust” practice allows your team to work effectively with stakeholders.

  • The “Stakeholder Demos” practice provides feedback about your team’s progress.

  • The “Forecasting” practice predicts when software will be released.

  • The “Roadmaps” practice shares your team’s progress and plans.

  • The “Management” practice helps teams excel.

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

Let’s Code JavaScript is Now Free

Ten years ago, I launched a Kickstarter for a screencast about test-driven development in JavaScript. It was enough of a success that I created a subscription screencast devoted to the topic. Over the next six years, I created over 150 hours of content across more than 600 videos.

For the past ten years, that content has only been available to subscribers. Not anymore. Starting today, I’m making it all free.

Check it out at letscodejavascript.com.

Agile Book Club: Evolutionary Design (with Kent Beck)

Software typically becomes more expensive to change over time. That’s a problem for Agile, because if change becomes significantly more expensive over time, the Agile model doesn’t make sense. So instead, Agile teams can use evolutionary design: a way of starting simple and improving your design as you go. With evolutionary design, changes become cheaper over time, not more expensive. It’s essential to long-term Agile success.

For this session, I’m thrilled to be joined by Kent Beck. Kent is the creator of Extreme Programming, the groundbreaking Agile method that introduced evolutionary design, test-driven development, continuous integration, and many other Agile practices to the world. It’s the basis of most of the material in The Art of Agile Development. Kent finds joy in programming and teaching programming, and describes his mission as helping geeks feel safe in the world. You can find his writings about design at tidyfirst.substack.com.

Reading:
📖 Design (introduction)
📖 Incremental Design
📖 Simple Design
📖 Reflective Design

🎙 Discussion prompts:

  • Evolutionary design involves keeping the code as simple as possible, growing the code only in response to new features. But what does “as simple as possible” mean in practice? Would you add code that you know you’ll need in an hour, or a day, or a month?

  • “Code smells” are one way to identify how to evolve a design. What code smells are particularly pungent? How can people learn to develop their sense of smell?

  • How do you decide what to change when evolving a design? Is it in response to new features, or code smells, or both? What does it look like in practice?

  • How does architecture fit into evolutionary design?

About the Book Club

The Art of Agile Development Book Club takes place Fridays from 8:00 – 8:45am Pacific. Each session uses an excerpt from the new edition of my book, The Art of Agile Development, as a jumping-off point for a wide-ranging discussion about Agile ideas and practices.

Visit the event page for more information, including an archive of past sessions. For more about the book, visit the Art of Agile Development home page.

AoAD2 Practice: Build for Operation

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

Build for Operation

Audience
Programmers, Operations

Our software is secure and easy to manage in production.

The fundamental idea behind DevOps is simple: by including people with operations and security skills as part of the team, we make it possible to build operability and security into the software, rather than adding it as an afterthought. This is building for operation.

That’s really all there is to it! Include people with ops and security skills on your team, or at least involve them in your team’s decisions. Have them participate in planning sessions. Create stories for making your software easier to monitor, manage, and secure. Discuss why those stories are important, and prioritize them accordingly.

Don’t save operations and security stories for the end of development.

Don’t save operations and security stories for the end of development. It’s better to keep your software ready to release. (See the “Key Idea: Minimize Work in Progress” sidebar.) As you add more capabilities to your software, expand your operability to match. For example, when you add a feature that requires a new database, add stories for provisioning, securing, monitoring, backing up, and restoring that database as well.

What sort of operations and security needs should you consider? Your teammates should be able to tell you. The following sections will help you get started.

Threat Modeling

Building for operation involves shifting left: thinking about security and operations needs from the beginning of development, not at the end. One way to understand those needs is threat modeling. It’s a security technique, but its analysis is helpful for operations, too.

Allies
Visual Planning
Blind Spot Discovery

Threat modeling is a process of understanding your software system and how it can be compromised. In his book Threat Modeling: Designing for Security, [Shostack2014] Adam Shostack described it as a process of answering four questions. It’s a good team activity:

  1. What are you building? Diagram your system architecture: the components of your deployed software, the data flow between components, and the trust or authorization boundaries between them.

  2. What can go wrong? Use simultaneous brainstorming (see the “Work Simultaneously” section) to think of possible ways each component and data flow could be attacked, then dot vote to narrow down your team’s top threats.

  3. What should you do about those things that can go wrong? Brainstorm ways to check or address the top threats, dot vote, and create story cards to do so. Add them to your visual plan.

  4. Did you do a decent job of analysis? Sleep on it, then take a second look at your work to see if you missed anything. Repeat the exercise regularly to incorporate new information and insights. Use blind spot discovery to find gaps in your thinking.

For more details, see [Shostack2014]. It’s written for people without prior security experience, and chapter 1 has everything you need to get started, including a card game for brainstorming threats. (The card game is also available for free online.) For a shorter introduction that includes a well-designed team activity, see Jim Gumbley’s “A Guide to Threat Modelling for Developers.” [Gumbley2020]

Configuration

According to The Twelve-Factor App [Wiggins2017], deployed software is a combination of code and configuration. Configuration, in this case, means the part of your software that’s different for each environment: database connection strings, URLs and secrets for third-party services, and so forth.

When you deploy, you’ll deploy the same code to every environment, whether that’s your local machine, a test environment, or production. But your configuration will change: for example, your test environment will be configured to use a test database, and your production environment will be configured to use your production database.

This definition of “configuration” includes only things that change between environments. Teams often make other things configurable, such as the copyright date in the footer of a website, but those types of configuration should be clearly separated from environment configuration. They’re part of your software’s behavior and should be treated like code, including version controlling it alongside your code. I’ll often use real code for this purpose, such as constants or getters in a Configuration module. (I’ll typically program that module to abstract environment configuration, too.)

Environment configuration, on the other hand, should be isolated from your code. It’s often stored in a separate repository. If you include it in your code repository, which makes sense when your team is responsible for deployments, keep it clearly segregated: for example, source code in a source directory and environment configuration in an environments directory. Then, during deployment, inject the configuration into the deployment environment by setting environment variables, copying files, and so forth. The specifics will depend on your deployment mechanism.

Ally
Feature Flags

Avoid making your software infinitely configurable. Complicated configuration ends up being a form of code—code that’s written in a particularly lousy programming language, without abstractions or tests. Instead, if you need sophisticated configurability, use feature flags to selectively turn behavior on and off. If you need complex customer-controlled behavior, consider using a plug-in architecture. Both approaches will allow you to code details using a real programming language.

Secrets

Team members shouldn’t have access to secrets.

Secrets—passwords, API keys, and so forth—are a special type of configuration. It’s particularly important they not be part of your source code. In fact, most team members shouldn’t have access to secrets at all. Instead, define a secure procedure for generating, storing, rotating, and auditing secrets. For complex systems, this often involves a secret management service or tool.

If you keep environment configuration in a separate repository, you can control access to secrets by strictly limiting access to the repository. If you keep environment configuration in your code repository, you’ll need to encrypt your secrets “at rest,” which means encrypting any files that contain secrets. Program your deployment script to decrypt the secrets prior to injecting them into the deployment environment.

Speaking of deployment, pay particular attention to how secrets are managed by your build and deployment automation. It’s convenient to hardcode secrets in a deploy script or CI server configuration, but that convenience isn’t worth the risk. Your automation needs the same secure procedures as the rest of your code.

Never write secrets to your logs. Because it’s easy to accidentally do so, consider writing your logging wrapper to look for secret-like data (for example, fields named “password” or “secret”) and redact them. When it finds one, have it trigger an alert for the team to fix the mistake.

Paranoiac Telemetry

No matter how carefully you program your code, it will still fail in production. Even perfect code depends on an imperfect world. External services return unexpected data or—worse—respond very slowly. File systems run out of space. Failovers...fail.

Every time your code interacts with the outside world, program it to assume the world is out to get you. Check every error code. Validate every response. Timeout nonresponsive systems. Program retry logic to use exponential back-offs.

When you can safely work around a broken system, do so. When you can’t, fail in a controlled and safe way. Either way, log the issue so your monitoring system can send an alert.

Logging

Nobody wants to be woken up in the middle of the night by a production issue. It happens anyway. When it does, the person on call needs to be able to diagnose and fix the problem with a minimum of fuss.

To make this easier, take a systematic and thoughtful approach to logging. Don’t just spam log statements everywhere in your code. Instead, think about what can go wrong and what people will need to know. Create user stories to address those questions. For example, if you discover a security breach, how will you determine which users and data were affected? If performance worsens, how will you determine what to fix? Prospective analysis (see the “Prospective Analysis” section) and your threat model can help you identify and prioritize these stories.

Use structured logs, routed to a centralized data store, to make your logs easier to search and filter. A structured log outputs data in a machine-readable format, such as JSON. Write your logging wrapper to support logging arbitrary objects. This will allow you to easily include variables that provide important context.

For example, I worked on a system that depended on a service that deprecated its APIs with special response headers. We coded our software to check for the presence of those headers and run this Node.js code:

log.action({
  code: "L16",
  message: "ExampleService API has been deprecated.",
  endpoint,
  headers: response.headers,
});

The output looked like this:

{
  "timestamp": 16206894860033,
  "date": "Mon May 10 2021 23:31:26 GMT",
  "alert": "action",
  "component": "ExampleApp",
  "node": "web.2",
  "correlationId": "b7398565-3b2b-4d80-b8e3-4f41fdb21a98",
  "code": "L16",
  "message": "ExampleService API has been deprecated.",
  "endpoint": "/v2/accounts/:account_id",
  "headers": {
    ⋮
    "x-api-version": "2.22",
    "x-deprecated": true,
    "x-sunset-date": "Wed November 10 2021 00:00:00 GMT"
  }
}

The extra context provided by the log message made the issue easy to diagnose. The endpoint string made it clear exactly which API had been deprecated, and the headers object allowed programmers to understand the details and eliminate the possibility of false positives.

Provide context when throwing exceptions, too. For example, if you have a switch statement that should never run the default case, you might assert that it’s unreachable. But don’t just throw “unreachable code executed.” Throw a detailed exception that will allow your team to troubleshoot the issue. For example: “Unknown user subscription type ‘legacy-036’ encountered while attempting to send onboarding email.”

In the logging example, you’ll see several standard fields. Most of them were added by the logging infrastructure. Consider adding them to your logs as well:

  • Timestamp: Machine-readable version of when the log occurred.

  • Date: A human-readable version of timestamp.

  • Alert: What kind of alert to send. Often called “level” instead. I’ll explain further in a moment.

  • Component: The codebase that generated the error.

  • Node: The specific machine that generated the error.

  • Correlation ID: A unique ID that groups together related logs, often across multiple services. For example, all logs related to a single HTTP request might have the same correlation ID. It’s also called a “request ID.”

  • Code: An arbitrary, unique code for this log message. It never changes. It can be used to search the logs and to look up documentation.

  • Message: A human-readable version of code. Unlike code, it can change.

Accompany your logs with documentation that provides a brief explanation of each alert and, more importantly, what to do about it. This will often be part of your runbook, which is a set of procedures and processes for a particular piece of software. For example:

L16: ExampleService API has been deprecated.

Explanation: ExampleService, our billing provider, has made plans to remove an API we use.

Alert: Action. Our application will stop working when they remove the API, so it’s vital that we address it, but they provide a grace period, so we don’t need to address it immediately.

Actions:

  • The code probably needs to be upgraded to the new API. If it does, the headers object, which contains the headers returned by ExampleService, should contain an x-deprecated header and an x-sunset-date header.

  • The endpoint field shows the specific API that caused the alert, but other endpoints we use could also be affected.

  • The urgency of the upgrade depends on when the API will be retired, which is shown by the x-sunset-date header. You can verify it by checking ExampleService’s documentation at example.com.

  • You’ll probably want to disable this alert until the API is upgraded. Be careful not to accidentally disable alerts for other APIs at the same time, and consider having the alert automatically re-enable, so the upgrade isn’t forgotten.

Describe what the reader needs to know, not what they need to do.

Note the casual, nondefinitive tone of the “Actions” section. That’s intentional. Detailed procedures can result in a “responsibility-authority double bind,” in which people are afraid to change a procedure even though it’s not appropriate for their situation. [Woods2010] (ch. 8) By describing what the reader needs to know, not what they need to do, authority for decision making is put back in the hands of the reader, allowing them to adapt the advice to the situation at hand.

Metrics and Observability

In addition to logging, your code should also measure items of interest, called metrics. Most metrics will be technical, such as the number of requests your application receives, the time it takes to respond to requests, memory usage, and so forth. But some will be business oriented, such as the number and value of customer purchases.

Metrics are typically accumulated for a period of time, then reported. This can be done inside your application and reported via a log message, or it can be done by sending events to a metric aggregation tool.

Take a thoughtful approach to observability.

Together, your logs and metrics create observability: the ability to understand how your system is behaving, both from a technical perspective and a business perspective. As with your logs, take a thoughtful approach to observability. Talk to your stakeholders. What observability do you need from an operations perspective? From a security perspective? From a business perspective? From a support perspective? Create user stories to address those needs.

Monitoring and Alerting

Monitoring detects when your logs and metrics need attention. When they do, the monitoring tool will send an alert—an email, chat notification, or even a text message or phone call—so somebody can take care of it. In some cases, the alerting will be done by a separate service.

The decisions about what to alert and what not to alert can get complicated, so your monitoring tool may be configurable via a proper programming language. If it is, be sure to treat that configuration with the respect due a real program. Store it in version control, pay attention to design, and write tests, if you can.

The types of alerts your monitoring tool sends will depend on your organization, but they typically fall into four categories:

  • Emergency: Something’s on fire, or about to be, and a human needs to wake up and fix it now.

  • Action: Something important needs attention, but it’s not urgent enough to wake anybody up.

  • Monitor: Something is unusual, and a human should take a second look. (Use these sparingly.)

  • Info: Doesn’t require anyone’s attention, but is useful for observability.

You’ll configure your monitoring tool to perform some alerts based on your logs. Although the cleanest route would be to program your logs to use the preceding terms, most logging libraries use FATAL, ERROR, WARN, INFO, and DEBUG instead. Although they technically have different meanings, you can map them directly to the preceding levels: use FATAL for Emergency, ERROR for Action, WARN for Monitor, and INFO for...Info. Don’t use DEBUG at all—it just adds noise.

It’s okay to use DEBUG logs during development, but don’t check them in. Some teams program their continuous integration script to automatically fail the build if it detects DEBUG logs.

Other alerts will be triggered based on metrics. Those alerts will usually be generated by your monitoring tool, not your application code. Be sure to give each one a look-up code, a human-readable message, and documentation in your runbook, just like your log messages.

Ally
Continuous Deployment

When possible, I prefer to program alerting decisions into my application code, triggering alerts via logs, not into my monitoring configuration. That allows team members to program the “smarts” of alerting using code they’re familiar with. The downside is that changing alerts requires redeploying your software, so this approach works best when your team uses continuous deployment.

Beware of alert fatigue. Remember: an “emergency” alert wakes somebody up. It doesn’t take many false alarms for people to stop paying attention to alerts, especially “monitor” alerts. Every alert needs to be acted upon: either to address the alert or to prevent a false alarm from occurring again. If an alert is consistently inappropriate for its level—such as an “emergency” that can actually be taken care of tomorrow, or a “monitor” alert that never indicates a real issue—consider downgrading it, or look for ways to make it smarter.

Either address the alert or prevent a false alarm from occurring again.

Similarly, an alert that consistently involves a rote response, with no real thinking, is a candidate for automating away. Convert the rote response into code. This particularly applies to “monitor” alerts, which can become a dumping ground for noisy trivia. It’s okay to create a “monitor” alert to help you understand the behavior of your system, but once you have that understanding, upgrade it to “action” or “emergency” by making the alert more selective.

To help keep alerts under control, be sure to include programmers in your on-call rotation. It will help them understand which alerts are important and which aren’t, and that will lead to code that makes better alerting decisions.

Some organizations have dedicated Operations teams who manage systems and take care of on-call duty. This works best when the Development team manages its own systems first. Before they hand off responsibility, they have to demonstrate a certain level of stability. For more information, see [Kim2016] (ch. 15).

Write tests for your logs and alerts. They’re just as important to the success of your software as user-facing capabilities. These tests can be difficult to write, because they often involve global state, but that’s a fixable design problem, as the “Control Global State” section discusses. Episodes 11–15 of [Shore2020b] demonstrate how to do so.

Questions

How do we make time for all this?

As reviewer Sarah Horan Van Treese said, you don’t have time not to. “In my experience, teams working on software that isn’t ‘built for operation’ typically waste an enormous amount of time on things that could be avoided altogether or at least diagnosed and resolved in minutes with better observability in place.” Take care of it now, or waste more time firefighting later.

Allies
Visual Planning
No Bugs
Slack

Operational and security needs can be scheduled with stories, just like everything else. To make sure those stories get prioritized, make sure that people with operations and security skills are part of your planning discussions.

Treat alerts like bugs: they should be unexpected and taken seriously when they happen. Every alert should be addressed immediately. This applies to false alarms, too: when you’re alerted for something that isn’t an issue, improve the alert so it’s less likely to cry wolf.

Quality-of-life improvements, such as tweaking an alert’s sensitivity or improving a log message, can be addressed with your team’s slack.

What if we don’t have anyone with ops or security skills on our team?

Take the time to reach out to your Operations department early in development. Let them know that you’d like their input on operational requirements and alerting, and ask how you can set up a regular check-in to get their feedback. I find that ops folks are pleasantly surprised when dev teams ask them to get involved before anything catches fire. They’re usually eager to help.

I have less experience with involving security folks, partly because they’re less common than ops people, but the general approach is the same: reach out early in development and discuss how you can build security in, rather than adding it as an afterthought.

Prerequisites

Ally
Whole Team

Building for operation can be done by any team. For best results, you’ll need a team that includes people with operations and security skills, or good relationships with people who have those skills, and a management culture that values long-term thinking.

Indicators

When you build for operation:

  • Your team has considered and addressed potential security threats.

  • Alerts are targeted and relevant.

  • When production issues occur, your organization is prepared to respond.

  • Your software is resilient and stable.

Alternatives and Experiments

This practice is ultimately about acknowledging that “working software” involves running it in production, not just coding it, and deliberately building your software to accommodate production needs. The best way I know to do so is to involve people with operations and security expertise as part of the team.

You’re welcome to experiment with other approaches. Companies often have limited operations and security staff, so it can be difficult to put people with those skills on every team. Checklists and automated enforcement tools are a common substitute. In my experience, they’re a pale imitation. A better approach is to provide training to develop team members’ skills.

Before you experiment, try to work with at least one team that takes a true DevOps approach, with skilled people embedded in the team. That way, you’ll know how your experiments compare.

Further Reading

The Twelve-Factor App [Wiggins2017] is a nice, concise introduction to operations needs, with solid guidance about how to address them.

The DevOps Handbook [Kim2016] is an in-depth guide to DevOps. Part IV, “The Technical Practices of Feedback,” discusses material that’s similar to this practice.

The Phoenix Project [Kim2013] is a novel about a beleaguered IT executive learning to introduce DevOps to his organization. Although it’s not strictly about building for operation, it’s a fun read and a good way to learn more about the DevOps mindset.

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

AoAD2 Chapter: DevOps (introduction)

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

📖 The full text of this section is available below, courtesy of the Art of Agile Development book club! Join us on Fridays from 8-8:45am Pacific for wide-ranging discussions about Agile. Details here.

DevOps

When I first started programming, my job was clear: build software and hand it off for release. After the handoff, a mysterious process would get the software into the hands of customers. First, it involved shipping CDs; later, it involved a distant department, called “Operations,” that seemed obsessed with bird calls. (awk! grep! perl!) Either way, it was no concern of mine.

This continued even after I started practicing Agile. Although Agile teams are meant to be cross-functional, operations were handled by other people—people I never met and rarely even knew the names of. I knew this wasn’t in the spirit of Agile, but the companies I worked with had strong walls between development and operations. Secretly, I was glad.

Fortunately, others in the Agile community weren’t so complacent. They worked to break down the walls between development and operations, and later, the walls separating security as well. The movement came to be known as DevOps. It’s also called DevSecOps.

As with so many things in the Agile ecosystem, the term “DevOps” has been distorted by well-meaning people making incorrect assumptions...and less well-meaning companies trying to make a quick buck. Here, I’m using it in the original sense of the word: close collaboration between development, operations, and security.

Some people extend DevOps to even more domains, with terms such as DevSecBizOps, DevSecBizDataOps, or even Dev<Everything>Ops. Of course, that brings us full circle to cross-functional, autonomous teams who have all the skills they need to be successful. Or, you know...Agile.

By breaking down the walls between development, operations, and security, DevOps allows your team to create software that’s safer, more reliable, and easier to manage in production. This chapter has four practices to help you do so:

  • The “Build for Operation” practice creates software that’s secure and easy to manage in production.

  • The “Feature Flags” practice allows your team to deploy software that’s incomplete.

  • The “Continuous Deployment” practice reduces the risk and cost of production deployment.

  • The “Evolutionary System Architecture” practice keeps your system simple, maintainable, and flexible.

Share your thoughts about this excerpt on the AoAD2 mailing list or Discord server. Or come to the weekly book club!

For more excerpts from the book, see the Second Edition home page.

Agile Book Club: Task Planning & Capacity

Iterations. Sprints. Estimates. Capacity. These are all at the heart of how teams plan their day-to-day work, and that’s what we talk about in this book club session.

Reading:
📖 Task Planning
📖 Capacity
📖 Slack
📖 “Done Done”

🎙 Discussion prompts:

  • There are two main approaches to task planning on Agile teams: time boxes (Sprints and iterations) and continuous flow (also known as Kanban). What tradeoffs do you see between the two?

  • What role do you think estimation should play in task planning?

  • How can teams improve their capacity?

  • Slack and capacity form a “clever little feedback loop” which uses teams’ weaknesses to make them stronger. What are some ways teams can use their slack to improve their capability?

About the Book Club

The Art of Agile Development Book Club takes place Fridays from 8:00 – 8:45am Pacific. Each session uses an excerpt from the new edition of my book, The Art of Agile Development, as a jumping-off point for a wide-ranging discussion about Agile ideas and practices.

Visit the event page for more information, including an archive of past sessions. For more about the book, visit the Art of Agile Development home page.

May
24
2022
Zero-Friction Development (Austin, TX)

I’m appearing as a guest speaker at CloudBees Connect in Austin, Texas on May 24th.

CloudBees Connect is a localized gathering of DevOps practitioners and leaders joining together to discuss trends and advancements defining the future of software delivery. I’ll be speaking about zero-friction development:

You know that feeling you get when you start a new codebase? It’s a feeling of endless possibility and productivity. You have an idea, you write some code, you see it come to life. Amazing.

But then reality comes crashing down. The code becomes difficult to maintain. You have to painstakingly vet every change, spend interminable hours looking for bugs, and bang your head against brick walls.

Zero-friction development is about achieving the smooth flow, enjoyment, and productivity you see in a brand-new codebase, no matter how old your code may be. James Shore explains how to make it happen.

CloudBees Connect is a free, all-day event. My talk will be from 9:50-10:30am. Find more information and register here.