AoAD2 Practice: Continuous Deployment

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Continuous Deployment

Audience
Programmers, Operations

Our latest code is in production.

If you use continuous integration, your team has removed most of the risk of releasing. Done correctly, continuous integration means that the team is ready to release at any time. You’ve tested your code and exercised your deployment scripts.

One source of risk remains. If you don’t deploy your software to real production servers, it’s possible that your software won’t actually work in production. Differences in environment, traffic, and usage can all result in failures, even in the most carefully-tested software.

Continuous deployment resolves this risk. It follows the same principle as continuous integration: by deploying small pieces frequently, we reduce the risk that a big change will cause problems, and we make it easy to find and fix problems when they do occur.

How to Use Continuous Deployment

Allies
Zero Friction
Continuous Integration
No Bugs
Feature Toggles
Build for Operation

Continuous deployment isn’t hard, but it has a lot of preconditions:

  • Create a zero-friction deploy script that automatically deploys your code.

  • Use continuous integration to keep your code ready to release.

  • Improve quality to the point that your software can be deployed without manual testing.

  • Use feature toggles or keystones to decouple deployments from releases.

  • Establish monitoring to alert your team of deployment failures.

Once these preconditions are met, enabling continuous deployment is just a matter of running deploy in your continuous integration script.

Ally
Whole Team

The details of your deploy script will depend on your organization. It can be as simple as running git push or rsync. More often, it involves creating a deployment image. Each deployment is a combination of code, which is the same for all environments, and configuration, which is unique to each environment, so you’ll also need a way to securely inject the appropriate configuration for the target environment. Your team should include people with operations skills that understand what’s required. If not, ask your operations department for help.

Your deploy script must be 100% automated. You’ll be deploying every time you integrate, which will be multiple times per day, and could even be multiple times per hour. Manual steps introduce delays and errors.

Detecting Deployment Failures

Your monitoring system should alert you if a deployment fails. At a minimum, this involves monitoring for an increase in errors or a decrease in performance, but you can also look at business metrics such as user sign-up rates.

To reduce the impact of failure, deploy to a subset of servers, called canary servers, and automatically compare metrics from the old deploy to the new deploy. If they’re substantially different, the script raises an alert and stops the deploy.

For systems with a lot of production servers, you can also have multiple waves of canary servers. For example, you could start by deploying to 10% of servers, then 50%, and finally all.

When the deploy is complete, regardless of the outcome, have your deploy script tag the deployed commit with “success” or “failure.”

Resolving Deployment Failures

One of the advantages of continuous deployment is that it reduces the risk of deployment. Because each deploy only represents a few hours of work, they tend to be low impact. If something does go wrong, the change can be reverted without affecting the rest of the system.

When a deployment does go wrong, immediately “stop the line” and focus the entire team on fixing the issue. Typically, this will involve rolling back the deploy.

Roll back the deploy

Start by restoring the system to its previous, working state. This typically involves a rollback, which involves restoring the previous deploy’s code and configuration. To do so, you can keep each deploy in a version control system, or you can just keep a copy of the most recent deploy.

One of the simplest ways to enable rollback is to use blue/green deployment. To do so, create two copies of your production environment, arbitrarily labelled “blue” and “green,” and configure your system to route traffic to one of the two environments. Each deploy toggles back and forth between the two environments, allowing you to roll back by routing traffic to the previous environment.

For example, if “blue” is active, deploy to “green.” When the deploy is complete, stop routing traffic to “blue” and route it to “green” instead. If the deploy fails, rolling back is a simple matter of routing traffic back to “blue.”

Occasionally, the rollback will fail. This may indicate a data corruption issue or a configuration problem. Either way, it’s all hands on deck until the problem is solved. Site Reliability Engineering [Beyer et al. 2016] has practical guidance about how to respond to such incidents in chapters 12-14.

Fix the deploy

Rolling back the bad deploy will usually solve the immediate production problem, but your team isn’t done yet. You need to fix the underlying problem. The first step is to get your integration branch back into a known-good state. You’re not trying to fix the problem, yet, you’re just trying to get your code and production environment back into sync.

Start by reverting the changes in the code repository, so your integration branch matches what’s actually in production. If you use merge commits in git, you can just run git revert on the integration commit. Whatever you do, use your normal continuous integration process to integrate and deploy the reverted code.

Deploying the reverted code should proceed without incident, because you’re deploying the same code that’s already running. It’s important to do so anyway, because it ensures your next deploy starts from a known-good state. Also, if this second deploy also has problems, it narrows the issue down to a deployment problem, not a problem in your code.

Ally
Incident Analysis

Once you’re back in a known-good state, you can fix the underlying mistake. Create tasks for debugging the problem—usually, the people who deployed it will fix it—and everybody can go back to working normally. After it’s been resolved, schedule an incident analysis session to determine how to prevent this sort of deployment failure from happening in the future.

Alternative: Fix forward

Some teams, rather than rolling back, fix forward. They make a quick fix—possibly by running git revert—and deploy again. The advantage of this approach is that you fix problems using your normal deployment script, so it’s well tested. Rollback scripts which don’t get used often can go out of date, causing them to fail just when you need them the most.

On the other hand, deploy scripts tend to be slow, even if you have an option to disable testing (which isn’t necessarily a good idea). A well-executed rollback script can complete in a few seconds. Fixing forward can take a few minutes. During an outage, those seconds count. For this reason, I tend to prefer rolling back, despite the disadvantages.

Incremental Releases

For large or risky changes, consider running the code in production, without revealing it to users, before you release. This will allow you to validate its performance and stability. (In comparison, a typical feature toggle prevents the hidden code from running at all.) For additional safety, the release can be performed gradually, enabling a subset of users at a time.

The DevOps Handbook [Kim et al. 2016] calls this a dark launch. Chapter 13 has an example of Facebook using this approach to release Facebook Chat:

As part of [Facebook Chat’s] dark launch process, every Facebook user session, which runs JavaScript in the user browser, had a test harness loaded into it—the chat UI elements were hidden, but the browser client would send invisible test chat messages to the back-end chat service that was already in production, enabling them to simulate production-like loads throughout the entire project, allowing them to find and fix performance problems long before the customer release.

By doing this, every Facebook user was part of a massive load testing program, which enabled the team to gain confidence that their systems could handle realistic production-like loads. ...When the launch day of Facebook Chat arrived, it was surprisingly successful and uneventful, seeming to scale effortlessly from zero to seventy million users overnight. During the release, they incrementally enabled the chat functionality to ever-larger segments of the customer population—first to all internal Facebook employees, then to 1% of the customer population, then to 5%, and so forth. As Letuchy wrote, “The secret for going from zero to seventy million users overnight is to avoid doing it all in one fell swoop.”

Data Migration

Database changes can’t be rolled back—at least, not without risking data loss—so data migration requires special care. It’s similar to performing an incremental release: first you deploy, then you migrate. There are three steps:

  1. Deploy code that understands both the new and old schema. Deploy the data migration code at the same time.

  2. After the deploy is successful, run the data migration code. It can be started manually, or automatically as part of your deploy script.

  3. When the migration is complete, manually remove the code that understands old schema, then deploy again.

Separating data migration from deployment allows each deploy to fail, and be rolled back, without losing any data. The data migration only happens once the new code has proven to be stable in production.

Migrations involving large amount of data require special care, because the production system needs to remain available while the data is migrating. For these sorts of migrations, write your migration code to work incrementally—possibly with a rate limiter, for performance reasons—and program your code to work with both schema simultaneously. For example, if you’re moving data from one table to another, your code might look at both tables when reading and updating data, but only insert data into the new table.

After the migration is complete, be sure to remove the outdated code. This will keep your code clean. For long migrations, you can add a reminder to your team calendar or schedule a “finish data migration” story into your team’s plan.

This process applies to any changes to external state. In addition to databases, it also includes configuration settings, infrastructure changes, and third-party service changes. Be very careful when external state is involved, because errors are difficult to undo. Smaller, more frequent changes are typically better than big, infrequent changes.

Prerequisites

Allies
Zero Friction
Continuous Integration
No Bugs
Feature Toggles
Build for Operation

To use continuous deployment, your team needs a rigorous approach to continuous integration. You need to integrate multiple times per day and create a known-good, deploy-ready build each time. “Deploy-ready,” in this case, means your team produces code that can be safely deployed without manual testing, and you’ll need to use feature toggles or keystones to hide incomplete code. Finally, your deploy process needs to be completely automated, and you need a way of automatically detecting deployment failures.

Indicators

When your team deploys continuously:

  • Deploying to production becomes a stress-free non-event.

  • When deployment problems occur, they’re easily resolved.

  • Deploys are less likely to cause production issues, and when they do, they’re quicker to fix.

Alternatives and Experiments

The typical alternative to continuous deployment is release-oriented deployment: only deploying when you have something ready to release. Continuous deployment is actually safer and more reliable, once the preconditions are in place, even though it sounds scarier at first.

You don’t have to switch from release-oriented deployment directly to continuous deployment. You can take it slowly, starting out by writing a fully-automated deploy script, then automatically deploying to a staging environment as part of continuous integration, and finally moving to continuous deployment.

In terms of experimentation, the core ideas of continuous deployment are to minimize work in progress and speed up the feedback loop (see “Key Idea: Minimize Work in Progress” on p.XX and “Key Idea: Fast Feedback” on p.XX). Anything that you can do to speed up that feedback loop and decrease the time required to deploy is moving in the right direction. For extra points, look for ways to speed up the feedback loop for release ideas, too.

Further Reading

The DevOps Handbook [Kim et al. 2016] is a thorough look at all aspects of DevOps, including continuous deployment, with a wealth of case studies and real-world examples.

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

If you liked this entry, check out my best writing and presentations, and consider subscribing to updates by email or RSS.