AoAD2 Practice: Build for Operation

Book cover for “The Art of Agile Development, Second Edition.”

Second Edition cover

This is a pre-release excerpt of The Art of Agile Development, Second Edition, to be published by O’Reilly in 2021. Visit the Second Edition home page for information about the open development process, additional excerpts, and more.

Your feedback is appreciated! To share your thoughts, join the AoAD2 open review mailing list.

This excerpt is copyright 2007, 2020, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Build for Operation

Audience
Programmers, Operations

Our software is secure and easy to manage in production.

The fundamental idea behind DevOps is simple: by including people with operations and security skills as part of the team, we make it possible to build operability and security into the software, rather than adding it as an afterthought. This is building for operation.

That’s really all there is to it! Include people with ops and security skills on your team, or at least involve them in your team’s decisions. Have them participate in planning sessions. Create stories for making your software easier to monitor, manage, and secure. Discuss why those stories are important, and prioritize them accordingly.

Don’t save operations and security stories for the end of development.

Don’t save operations and security stories for the end of development. It’s better to keep your software ready to release. (See “Key Idea: Minimize Work in Progress” on p.XX.) As you add more capabilities to your software, expand your operability to match. For example, when you add a feature that requires a new database, add stories for provisioning, monitoring, backing up, and restoring that database as well.

What sort of operations and security needs should you consider? Your teammates should be able to tell you. But the following sections will help you get started.

Configuration

According to The Twelve-Factor App, deployed software is a combination of code and configuration. [Wiggins 2017] Configuration, in this case, means the part of your software that’s different for each environment: database connection strings, URLs and secrets for third-party services, and so forth.

When you deploy, you’ll deploy the same code to every environment, whether that’s your local machine, a test environment, or production. But your configuration will change: for example, your test environment will be configured to use a test database, and your production environment will be configured to use your production database.

This definition of “configuration” only includes the things that change between environments. Teams often make other things configurable, such as the copyright date in the footer of a website, but those types of configuration should be clearly separated from environment configuration. They’re part of the behavior of the code, and should be treated like code, including checking it into version control. I’ll often use real code, in the form of a Configuration module, for this purpose. (I’ll typically program it to abstract environment configuration, too.)

Environment configuration, on the other hand, should never be checked into version control, other than what’s needed for your local rundev script. (See “Automate Everything” on p.XX.) If you check it in, it’s too easy to accidentally deploy it to the wrong environment. You could end up attaching your test environment to your production database, or vice versa. Instead, leave the configuration out, and program your code to fail fast (see “Fail Fast” on p.XX) if it can’t find the configuration it’s looking for.

Rather than checking your configuration into version control, inject it during deployment. The specifics will depend on your deployment mechanism, but setting environment variables is a common choice.

Configuration files are another popular choice for configuration, but they’re risky, at least for environment configuration. It’s too easy to accidentally check them into source control.

Allies
Feature Toggles
Test-Driven Development
Continuous Deployment

Avoid making your software infinitely configurable. Complicated configuration ends up being a form of code—code that’s written in a particularly lousy programming language, without abstractions or tests. Instead, if you need sophisticated configurability, use feature toggles to selectively turn behavior on and off. If you need complex customer-controlled behavior, consider using a plug-in architecture. Both approaches will allow you to code the details using a real programming language.

Environment configuration needs to be stored somewhere, of course, and that can involve a version control system, but it’s typically best if it’s a separate repository with tight security controls.

Secrets

Team members shouldn’t have access to secrets.

Secrets—passwords, API keys, and so forth—are a special type of configuration. It’s particularly important that they not be checked in to version control. In fact, team members shouldn’t have access to secrets at all. Instead, define a secure procedure for generating, storing, rotating, and auditing secrets. For complex systems, this often involves a secret management tool. For smaller teams and organizations, a clear policy and limited access to production environments can be enough.

When feasible, write your code to not store secrets in memory. That way, if an attacker compromises your software, they won’t get permanent access to additional systems. Instead, use the secret to log in—you’ll typically get a temporary access token in exchange—then erase the secret, including the mechanism that provided it. For example, if the secret was configured with an environment variable, erase the environment variable.

Pay particular attention to how secrets are managed by your build and deployment automation. It’s convenient to hardcode secrets in a deploy script or CI server configuration, but that convenience isn’t worth the risk. Your automation needs the same secure procedures as the rest of your code.

Never write secrets to your logs. Because it’s easy to accidentally do so, consider writing your logging wrapper so that it looks for secret-like data (for example, fields named “password” or “secret”) and redacts them. When it finds one, have it trigger an alert for the team to fix their mistake.

Privacy

Although compliance with privacy regulations should be part of your product plans, it often has ramifications on-site customers aren’t aware of. In particular, the EU’s General Data Protection Regulation (GDPR) includes a “right to be forgotten” which affects logging and database design.

Often, it’s best to store personally-identifiable information (PII) in a central location, along with a pseudonymous ID and unique encryption key. Program other parts of your system to reference the PII by the ID. When your code generates additional personal data, use the key to encrypt it. Later, the ID and encryption key can be deleted, or reset, removing access to the PII in one step.

Logs require special attention. Be careful to consider which data is being stored by your logs, and replace PII with pseudonymous IDs or encrypted data.

Paranoic Telemetry

No matter how carefully you program your code, it will still fail in production. That’s because even perfect code depends on an imperfect world. External services return unexpected data or—worse—respond very slowly. File systems run out of space. Failovers... fail.

Every time your code interacts with the outside world, program it to assume the world is out to get you. Check every error code. Validate every response. Timeout non-responsive systems. Program retry logic to use exponential back-offs.

When you can safely work around a broken system, do so. When you can’t, fail in a controlled and safe way. Either way, log the issue so your monitoring system can send an alert.

Logging

Nobody wants to be woken up in the middle of the night by a production issue. It happens anyway. When it does, the person on call needs to be able to diagnose and fix the problem with a minimum of fuss.

To make this easier, take a systematic and thoughtful approach to logging. Don’t just spam log statements everywhere in your code. Instead, think about what can go wrong and what people will need to know. Create user stories to address those questions. For example, if you discover a security breach, how will you determine which users and data were affected? If performance worsens, how will you determine what to fix? Prospective analysis (see “Prospective Analysis” on p.XX) can help you identify and prioritize these stories.

Use structured logs, routed to a centralized data store, to make your logs easier to search and filter. A structured log outputs data in a machine-readable format, such as JSON. Write your logging wrapper so that it can log arbitrary objects. This will allow you to easily include variables that provide important context.

For example, a third-party billing service set special headers when it depecrated its APIs. A system using that service checked for the presence of those headers. When it found them, it ran this Node.js code:

log.action({
  code: "L16",
  message: "ExampleService API has been deprecated.",
  endpoint,
  headers: response.headers,
});

The output looked like this:

{
  "timestamp": 16206894860033,
  "date": "Mon May 10 2021 23:31:26 GMT",
  "alert": "action",
  "component": "ExampleApp",
  "node": "web.2",
  "correlationId": "b7398565-3b2b-4d80-b8e3-4f41fdb21a98",
  "code": "L16",
  "message": "ExampleService API has been deprecated.",
  "endpoint": "/v2/accounts/:account_id",
  "headers": {
    ⋮
    "x-api-version": "2.22",
    "x-deprecated": true,
    "x-sunset-date": "Wed November 10 2021 00:00:00 GMT"
  }
}

The extra context provided by the log message made the issue easy to diagnose. The endpoint string made it clear exactly which API had been deprecated, and the headers object allowed programmers to understand the details and eliminate the possiblity of false positives.

Provide context when throwing exceptions, too. For example, if you have a switch statement that should never run the default case, you might assert that it’s unreachable. But don’t just throw “unreachable code executed.” Throw a detailed exception that will allow your team to troubleshoot the issue. For example: “Unknown user subscription type 'legacy-036' encountered while attempting to send onboarding email.”

In the logging example, you’ll see several standard fields. Most of them are added by the logging infrastructure. Consider adding them to your logs as well:

  • Timestamp: Machine-readable version of when the log occurred.

  • Date: A human-readable version of timestamp.

  • Alert: What kind of alert to send. Often called “level” instead. I’ll explain further when I discuss alerting.

  • Component: The codebase that generated the error.

  • Node: The specific machine that generated the error.

  • Correlation ID: A unique ID that groups together related logs, often across multiple services. For example, all logs related to a single HTTP request might have the same correlation ID. It’s also called a “request ID.”

  • Code: An arbitrary, unique code for this log message. It never changes. It can be used to filter or search the logs and to look up documentation.

  • Message: A human-readable version of code. Unlike code, it can change.

Accompany your logs with documentation that provides a brief explanation of each alert and, more importantly, what to do about it. This will often be part of your runbook, which is a set of procedures and processes for a particular piece of software. For example:

L16: ExampleService API has been deprecated.

Explanation: ExampleService, our billing provider, has made plans to remove an API that we use.

Alert: Action. Our application will stop working when they remove the API, so we need to address it, but they provide a grace period, so we don’t need to address it immediately.

Actions:

  1. Look at the endpoint field to determine which API is affected. Use the x-sunset-date header and ExampleService’s documentation to determine when the API will be removed. Check to see if any other endpoints we use are affected.

  2. Create and prioritize a story (or stories) to upgrade to the current version of the API.

  3. Disable this alert, but only for the affected endpoints. Program it to automatically re-enable when the story is expected to be complete. If the story isn’t complete by then, the alert will remind us to double-check its priority.

Metrics

In addition to logging, your code should also measure items of interest, called metrics. Most metrics will be technical, such as the number of requests your application receives, the time it takes to respond to requests, memory usage, and so forth. But you can also collect business metrics, such as the number and value of customer purchases.

Metrics are typically accumulated for a period of time, then reported. This can be done inside your application, and reported via a log message, or it can be done by sending events to a metric aggregation tool.

As with your logs, collect metrics based on an actual identified need. Discuss what your team and stakeholders need to know, then create user stories to address those needs.

Monitoring and Alerting

Monitoring detects when something about your logs and metrics need attention. When they do, the monitoring tool will send an alert—an email, chat notification, or even text message—so that somebody can take care of it. In some cases, the alerting will be done by a separate service, such as PagerDuty.

The decisions about what to alert and what not to alert can get complicated, so your monitoring tool may be configurable via a proper programming language. If it is, be sure to treat that configuration with the respect due a real program. Store it in version control, pay attention to design, and write tests, if you can.

The types of alerts your monitoring tool sends will depend on your organization, but they typically fall into four categories:

  • Emergency: Something’s on fire, or about to be, and a human needs to wake up and fix it now.

  • Action: Something important needs attention, but it’s not urgent enough to wake anybody up.

  • Monitor: Something is unusual, and a human should take a second look.

  • Info: Doesn’t require anyone’s attention.

You’ll configure your monitoring tool to perform some alerts based on your logs. Although the cleanest route would be to program your logs to use the above terms, most logging libraries use FATAL, ERROR, WARN, INFO, and DEBUG instead. Although they technically have different meanings, you can map them directly to the above levels: use FATAL for Emergency, ERROR for Action, WARN for Monitor, and INFO for... Info. Don’t use DEBUG at all—it just adds noise.

It’s okay to use DEBUG logs during development, but don’t check them in. Some teams program their continuous integration script to automatically fail the build if it detects DEBUG logs.

Other alerts will be triggered based on metrics. Those alerts will usually be generated by your monitoring tool, not your application code. Be sure to give each one a look-up code, a human-readable message, and documentation in your runbook, just like your log messages.

Ally
Continuous Deployment

When possible, I prefer to program alerting decisions into my application code, triggering alerts via logs, not into my monitoring configuration. That allows the team to program the “smarts” of alerting using code they’re familiar with. The downside is that changing alerts requires redeploying your software, so this approach works best when your team uses continuous deployment.

Either address the alert, or prevent a false alarm from occurring again.

Beware of alert fatigue. An “Emergency” alert wakes somebody up. It doesn’t take many false alarms for people to stop paying attention to alerts, even (or especially) lowly “Monitor” alerts. Every alert needs to be acted upon: either to address the alert, or to prevent a false alarm from occurring again. If an alert is consistently inappropriate for its level—such as an “emergency” that can actually be taken care of tomorrow, or a “monitor” alert which never indicates a real issue—consider downgrading it, or look for ways to make it smarter.

To help keep alerts under control, be sure to include programmers in your on-call rotation. It will help them understand which alerts are important and which aren’t, and that will lead to code which makes better alerting decisions.

Some organizations require the development team to manage their own systems prior to handing responsibility off to Operations. Before they do, they have to demonstrate a certain level of stability. After handing off responsibility, if the stability degrades, Operations can “hand back” responsibility to the development team. For more information, see [Kim et al. 2016] (ch. 15).

Write tests for your logs and alerts. They’re just as important to the success of your software as user-facing capabilities. These tests can be difficult to write, because they often involve global state, but that’s a fixable design problem, as “Control Global State” on p.XX discusses. You can use infrastructure wrappers to isolate the global state and make it testable. For example, I use a nullable SystemClock wrapper to control the current date, and a nullable Log wrapper to see what has been written to the log. Episodes 11-15 of [Shore 2020b] demonstrate how to create such wrappers.

Questions

How do we make time for all this?

Allies
Visual Planning
No Bugs
Slack

Operational and security needs can be scheduled with stories, just like everything else. To make sure those stories get prioritized, make sure that people with operations and security skills are part of your planning discussions.

Treat alerts like bugs: they should be unexpected and taken seriously when they happen. Every alert should be addressed immediately. This applies to false alarms, too: when you’re alerted for something that isn’t an issue, improve the alert so it’s less likely to cry wolf.

Quality-of-life improvements, such as tweaking an alert’s sensitivity or improving a log message, can be addressed with your team’s slack.

What if we don’t have anyone with ops or security skills on our team?

Take the time to reach out to your Operations department early in development. Let them know that you’d like their input on operational requirements and alerting, and ask how you can set up a regular check-in to get their feedback. I find that ops folks are pleasantly surprised when dev teams ask them to get involved before anything catches fire. They’re usually eager to help.

I have less experience with involving security folks, partly because they’re less common than ops people, but the general approach is the same: reach out early in development and discuss how you can build security in, rather than adding it as an afterthought.

Prerequisites

Ally
Whole Team

Building for operation can theoretically be done by any team. For best results, though, you’ll need a team that includes people with operations and security skills, or good relationships with people who have those skills, and a management culture that values long-term thinking.

Indicators

When you build for operation:

  • Alerts are targeted and relevant.

  • When production issues occur, your company is prepared to respond quickly and calmly.

  • Your software is resilient and stable.

Alternatives and Experiments

This practice is ultimately about acknowledging that “working software” involves running it in production, not just coding it, and deliberately building your software to accommodate production needs. The best way I know to do so is to involve people with operations and security expertise as part of the team.

You’re welcome to experiment with other approaches. Companies often have limited operations and security staff, so it can be difficult to put people with those skills on every team. Checklists and automated enforcement tools are a common substitute. In my experience, they’re a pale imitation. A better approach is to provide training to develop team members’ operations and security skills.

Before you experiment, try to work with at least one team which takes a true DevOps approach, with skilled people embedded in the team. That way, you’ll know how your experiments compare.

Further Reading

The Twelve-Factor App [Wiggins 2017] is a nice, concise introduction to operations needs, with solid guidance about how to address them.

The DevOps Handbook [Kim et al. 2016] is an in-depth guide to DevOps. Part IV, “The Technical Practices of Feedback,” discusses material that’s similar to this practice.

The Phoenix Project [Kim et al. 2013] is a novel about a beleaguered IT executive learning to introduce DevOps to his organization. Although it’s not strictly about building for operation, it’s a fun read and a good way to learn more about the DevOps mindset.

XXX The Unicorn Project?

Share your feedback about this excerpt on the AoAD2 mailing list! Sign up here.

For more excerpts from the book, or to get a copy of the Early Release, see the Second Edition home page.

If you liked this entry, check out my best writing and presentations, and consider subscribing to updates by email or RSS.