James Shore: AoAD2 Practice: Build for Operation

AoAD2 Practice: Build for Operation

Book cover for “The Art of Agile Development, Second Edition.” by James Shore.

This is an excerpt from The Art of Agile Development, Second Edition. Visit the Second Edition home page for additional excerpts and more!

This excerpt is copyright 2007, 2021 by James Shore and Shane Warden. Although you are welcome to share this link, do not distribute or republish the content without James Shore’s express written permission.

Build for Operation

Audience: Programmers, Operations

Our software is secure and easy to manage in production.

The fundamental idea behind DevOps is simple: by including people with operations and security skills as part of the team, we make it possible to build operability and security into the software, rather than adding it as an afterthought. This is building for operation.

That’s really all there is to it! Include people with ops and security skills on your team, or at least involve them in your team’s decisions. Have them participate in planning sessions. Create stories for making your software easier to monitor, manage, and secure. Discuss why those stories are important, and prioritize them accordingly.

Don’t save operations and security stories for the end of development.

Don’t save operations and security stories for the end of development. It’s better to keep your software ready to release. (See the “Key Idea: Minimize Work in Progress” sidebar.) As you add more capabilities to your software, expand your operability to match. For example, when you add a feature that requires a new database, add stories for provisioning, securing, monitoring, backing up, and restoring that database as well.

What sort of operations and security needs should you consider? Your teammates should be able to tell you. The following sections will help you get started.

Threat Modeling

Building for operation involves shifting left: thinking about security and operations needs from the beginning of development, not at the end. One way to understand those needs is threat modeling. It’s a security technique, but its analysis is helpful for operations, too.

Allies: Visual Planning; Blind Spot Discovery

Threat modeling is a process of understanding your software system and how it can be compromised. In his book Threat Modeling: Designing for Security, [Shostack2014] Adam Shostack described it as a process of answering four questions. It’s a good team activity:

What are you building? Diagram your system architecture: the components of your deployed software, the data flow between components, and the trust or authorization boundaries between them.
What can go wrong? Use simultaneous brainstorming (see the “Work Simultaneously” section) to think of possible ways each component and data flow could be attacked, then dot vote to narrow down your team’s top threats.
What should you do about those things that can go wrong? Brainstorm ways to check or address the top threats, dot vote, and create story cards to do so. Add them to your visual plan.
Did you do a decent job of analysis? Sleep on it, then take a second look at your work to see if you missed anything. Repeat the exercise regularly to incorporate new information and insights. Use blind spot discovery to find gaps in your thinking.

For more details, see [Shostack2014]. It’s written for people without prior security experience, and chapter 1 has everything you need to get started, including a card game for brainstorming threats. (The card game is also available for free online.) For a shorter introduction that includes a well-designed team activity, see Jim Gumbley’s “A Guide to Threat Modelling for Developers.” [Gumbley2020]

Configuration

According to The Twelve-Factor App [Wiggins2017], deployed software is a combination of code and configuration. Configuration, in this case, means the part of your software that’s different for each environment: database connection strings, URLs and secrets for third-party services, and so forth.

When you deploy, you’ll deploy the same code to every environment, whether that’s your local machine, a test environment, or production. But your configuration will change: for example, your test environment will be configured to use a test database, and your production environment will be configured to use your production database.

This definition of “configuration” includes only things that change between environments. Teams often make other things configurable, such as the copyright date in the footer of a website, but those types of configuration should be clearly separated from environment configuration. They’re part of your software’s behavior and should be treated like code, including version controlling it alongside your code. I’ll often use real code for this purpose, such as constants or getters in a Configuration module. (I’ll typically program that module to abstract environment configuration, too.)

Environment configuration, on the other hand, should be isolated from your code. It’s often stored in a separate repository. If you include it in your code repository, which makes sense when your team is responsible for deployments, keep it clearly segregated: for example, source code in a source directory and environment configuration in an environments directory. Then, during deployment, inject the configuration into the deployment environment by setting environment variables, copying files, and so forth. The specifics will depend on your deployment mechanism.

Ally: Feature Flags

Avoid making your software infinitely configurable. Complicated configuration ends up being a form of code—code that’s written in a particularly lousy programming language, without abstractions or tests. Instead, if you need sophisticated configurability, use feature flags to selectively turn behavior on and off. If you need complex customer-controlled behavior, consider using a plug-in architecture. Both approaches will allow you to code details using a real programming language.

Secrets

Team members shouldn’t have access to secrets.

Secrets—passwords, API keys, and so forth—are a special type of configuration. It’s particularly important they not be part of your source code. In fact, most team members shouldn’t have access to secrets at all. Instead, define a secure procedure for generating, storing, rotating, and auditing secrets. For complex systems, this often involves a secret management service or tool.

If you keep environment configuration in a separate repository, you can control access to secrets by strictly limiting access to the repository. If you keep environment configuration in your code repository, you’ll need to encrypt your secrets “at rest,” which means encrypting any files that contain secrets. Program your deployment script to decrypt the secrets prior to injecting them into the deployment environment.

Speaking of deployment, pay particular attention to how secrets are managed by your build and deployment automation. It’s convenient to hardcode secrets in a deploy script or CI server configuration, but that convenience isn’t worth the risk. Your automation needs the same secure procedures as the rest of your code.

Never write secrets to your logs. Because it’s easy to accidentally do so, consider writing your logging wrapper to look for secret-like data (for example, fields named “password” or “secret”) and redact them. When it finds one, have it trigger an alert for the team to fix the mistake.

Paranoiac Telemetry

No matter how carefully you program your code, it will still fail in production. Even perfect code depends on an imperfect world. External services return unexpected data or—worse—respond very slowly. File systems run out of space. Failovers...fail.

Every time your code interacts with the outside world, program it to assume the world is out to get you. Check every error code. Validate every response. Timeout nonresponsive systems. Program retry logic to use exponential back-offs.

When you can safely work around a broken system, do so. When you can’t, fail in a controlled and safe way. Either way, log the issue so your monitoring system can send an alert.

Logging

Nobody wants to be woken up in the middle of the night by a production issue. It happens anyway. When it does, the person on call needs to be able to diagnose and fix the problem with a minimum of fuss.

To make this easier, take a systematic and thoughtful approach to logging. Don’t just spam log statements everywhere in your code. Instead, think about what can go wrong and what people will need to know. Create user stories to address those questions. For example, if you discover a security breach, how will you determine which users and data were affected? If performance worsens, how will you determine what to fix? Prospective analysis (see the “Prospective Analysis” section) and your threat model can help you identify and prioritize these stories.

Use structured logs, routed to a centralized data store, to make your logs easier to search and filter. A structured log outputs data in a machine-readable format, such as JSON. Write your logging wrapper to support logging arbitrary objects. This will allow you to easily include variables that provide important context.

For example, I worked on a system that depended on a service that deprecated its APIs with special response headers. We coded our software to check for the presence of those headers and run this Node.js code:

log.action({
  code: "L16",
  message: "ExampleService API has been deprecated.",
  endpoint,
  headers: response.headers,
});

The output looked like this:

{
  "timestamp": 16206894860033,
  "date": "Mon May 10 2021 23:31:26 GMT",
  "alert": "action",
  "component": "ExampleApp",
  "node": "web.2",
  "correlationId": "b7398565-3b2b-4d80-b8e3-4f41fdb21a98",
  "code": "L16",
  "message": "ExampleService API has been deprecated.",
  "endpoint": "/v2/accounts/:account_id",
  "headers": {
    ⋮
    "x-api-version": "2.22",
    "x-deprecated": true,
    "x-sunset-date": "Wed November 10 2021 00:00:00 GMT"
  }
}

The extra context provided by the log message made the issue easy to diagnose. The endpoint string made it clear exactly which API had been deprecated, and the headers object allowed programmers to understand the details and eliminate the possibility of false positives.

Provide context when throwing exceptions, too. For example, if you have a switch statement that should never run the default case, you might assert that it’s unreachable. But don’t just throw “unreachable code executed.” Throw a detailed exception that will allow your team to troubleshoot the issue. For example: “Unknown user subscription type ‘legacy-036’ encountered while attempting to send onboarding email.”

In the logging example, you’ll see several standard fields. Most of them were added by the logging infrastructure. Consider adding them to your logs as well:

Timestamp: Machine-readable version of when the log occurred.
Date: A human-readable version of timestamp.
Alert: What kind of alert to send. Often called “level” instead. I’ll explain further in a moment.
Component: The codebase that generated the error.
Node: The specific machine that generated the error.
Correlation ID: A unique ID that groups together related logs, often across multiple services. For example, all logs related to a single HTTP request might have the same correlation ID. It’s also called a “request ID.”
Code: An arbitrary, unique code for this log message. It never changes. It can be used to search the logs and to look up documentation.
Message: A human-readable version of code. Unlike code, it can change.

Accompany your logs with documentation that provides a brief explanation of each alert and, more importantly, what to do about it. This will often be part of your runbook, which is a set of procedures and processes for a particular piece of software. For example:

L16: ExampleService API has been deprecated.

Explanation: ExampleService, our billing provider, has made plans to remove an API we use.

Alert: Action. Our application will stop working when they remove the API, so it’s vital that we address it, but they provide a grace period, so we don’t need to address it immediately.

Actions:

The code probably needs to be upgraded to the new API. If it does, the headers object, which contains the headers returned by ExampleService, should contain an x-deprecated header and an x-sunset-date header.

The endpoint field shows the specific API that caused the alert, but other endpoints we use could also be affected.

The urgency of the upgrade depends on when the API will be retired, which is shown by the x-sunset-date header. You can verify it by checking ExampleService’s documentation at example.com.

You’ll probably want to disable this alert until the API is upgraded. Be careful not to accidentally disable alerts for other APIs at the same time, and consider having the alert automatically re-enable, so the upgrade isn’t forgotten.

Describe what the reader needs to know, not what they need to do.

Note the casual, nondefinitive tone of the “Actions” section. That’s intentional. Detailed procedures can result in a “responsibility-authority double bind,” in which people are afraid to change a procedure even though it’s not appropriate for their situation. [Woods2010] (ch. 8) By describing what the reader needs to know, not what they need to do, authority for decision making is put back in the hands of the reader, allowing them to adapt the advice to the situation at hand.

Metrics and Observability

In addition to logging, your code should also measure items of interest, called metrics. Most metrics will be technical, such as the number of requests your application receives, the time it takes to respond to requests, memory usage, and so forth. But some will be business oriented, such as the number and value of customer purchases.

Metrics are typically accumulated for a period of time, then reported. This can be done inside your application and reported via a log message, or it can be done by sending events to a metric aggregation tool.

Take a thoughtful approach to observability.

Together, your logs and metrics create observability: the ability to understand how your system is behaving, both from a technical perspective and a business perspective. As with your logs, take a thoughtful approach to observability. Talk to your stakeholders. What observability do you need from an operations perspective? From a security perspective? From a business perspective? From a support perspective? Create user stories to address those needs.

Monitoring and Alerting

Monitoring detects when your logs and metrics need attention. When they do, the monitoring tool will send an alert—an email, chat notification, or even a text message or phone call—so somebody can take care of it. In some cases, the alerting will be done by a separate service.

The decisions about what to alert and what not to alert can get complicated, so your monitoring tool may be configurable via a proper programming language. If it is, be sure to treat that configuration with the respect due a real program. Store it in version control, pay attention to design, and write tests, if you can.

The types of alerts your monitoring tool sends will depend on your organization, but they typically fall into four categories:

Emergency: Something’s on fire, or about to be, and a human needs to wake up and fix it now.
Action: Something important needs attention, but it’s not urgent enough to wake anybody up.
Monitor: Something is unusual, and a human should take a second look. (Use these sparingly.)
Info: Doesn’t require anyone’s attention, but is useful for observability.

You’ll configure your monitoring tool to perform some alerts based on your logs. Although the cleanest route would be to program your logs to use the preceding terms, most logging libraries use FATAL, ERROR, WARN, INFO, and DEBUG instead. Although they technically have different meanings, you can map them directly to the preceding levels: use FATAL for Emergency, ERROR for Action, WARN for Monitor, and INFO for...Info. Don’t use DEBUG at all—it just adds noise.

It’s okay to use DEBUG logs during development, but don’t check them in. Some teams program their continuous integration script to automatically fail the build if it detects DEBUG logs.

Other alerts will be triggered based on metrics. Those alerts will usually be generated by your monitoring tool, not your application code. Be sure to give each one a look-up code, a human-readable message, and documentation in your runbook, just like your log messages.

Ally: Continuous Deployment

When possible, I prefer to program alerting decisions into my application code, triggering alerts via logs, not into my monitoring configuration. That allows team members to program the “smarts” of alerting using code they’re familiar with. The downside is that changing alerts requires redeploying your software, so this approach works best when your team uses continuous deployment.

Beware of alert fatigue. Remember: an “emergency” alert wakes somebody up. It doesn’t take many false alarms for people to stop paying attention to alerts, especially “monitor” alerts. Every alert needs to be acted upon: either to address the alert or to prevent a false alarm from occurring again. If an alert is consistently inappropriate for its level—such as an “emergency” that can actually be taken care of tomorrow, or a “monitor” alert that never indicates a real issue—consider downgrading it, or look for ways to make it smarter.

Either address the alert or prevent a false alarm from occurring again.

Similarly, an alert that consistently involves a rote response, with no real thinking, is a candidate for automating away. Convert the rote response into code. This particularly applies to “monitor” alerts, which can become a dumping ground for noisy trivia. It’s okay to create a “monitor” alert to help you understand the behavior of your system, but once you have that understanding, upgrade it to “action” or “emergency” by making the alert more selective.

To help keep alerts under control, be sure to include programmers in your on-call rotation. It will help them understand which alerts are important and which aren’t, and that will lead to code that makes better alerting decisions.

Some organizations have dedicated Operations teams who manage systems and take care of on-call duty. This works best when the Development team manages its own systems first. Before they hand off responsibility, they have to demonstrate a certain level of stability. For more information, see [Kim2016] (ch. 15).

Write tests for your logs and alerts. They’re just as important to the success of your software as user-facing capabilities. These tests can be difficult to write, because they often involve global state, but that’s a fixable design problem, as the “Control Global State” section discusses. Episodes 11–15 of [Shore2020b] demonstrate how to do so.

Cargo Cult: The DevOps Team

It’s your first week on the job when you stick your head into your boss’s office. “Hey Waldo, do you have a moment? I have a quick question.” He nods genially and motions you to take a seat.

You make yourself comfortable. “Who are the DevOps people on our team? I have an issue I need to discuss with someone, but I haven’t been able to figure out how we do DevOps.”

“On the team?” Waldo raises an eyebrow. “No, we have a separate DevOps team, of course. ProdOps and DataOps, too. DevOps takes care of test environments and CI tooling, ProdOps handles the production environment, and DataOps is in charge of databases and schema. Shouldn’t you know this? Your last company must have been pretty small to mix everyone together.”

“No, we...” You shake your head. “Long story. It was a different kind of DevOps. Anyway, can I talk to the ProdOps folks? It’s possible for the API I’m using to throw a ‘Disk Full’ exception, and I want to coordinate with them on the appropriate response.”

Waldo sucks air through his teeth. “Ooooh, I wouldn’t do that. They’re kind of prickly, and super busy. Everything has to go through their ticketing system, and they hate it when we waste their time. Anyway, when will ‘Disk Full’ ever happen? Just log the exception and return null. That’s how we do things here. We’ve got a deadline to meet.”

You take a breath to argue, but Waldo holds up his hand. “Look, I know it’s different than what you’re used to, but we’re a big company. You can’t just go around interrupting people with trivialities. Things have to be done properly.”

Questions

How do we make time for all this?

As reviewer Sarah Horan Van Treese said, you don’t have time not to. “In my experience, teams working on software that isn’t ‘built for operation’ typically waste an enormous amount of time on things that could be avoided altogether or at least diagnosed and resolved in minutes with better observability in place.” Take care of it now, or waste more time firefighting later.

Allies: Visual Planning; No Bugs; Slack

Operational and security needs can be scheduled with stories, just like everything else. To make sure those stories get prioritized, make sure that people with operations and security skills are part of your planning discussions.

Treat alerts like bugs: they should be unexpected and taken seriously when they happen. Every alert should be addressed immediately. This applies to false alarms, too: when you’re alerted for something that isn’t an issue, improve the alert so it’s less likely to cry wolf.

Quality-of-life improvements, such as tweaking an alert’s sensitivity or improving a log message, can be addressed with your team’s slack.

What if we don’t have anyone with ops or security skills on our team?

Take the time to reach out to your Operations department early in development. Let them know that you’d like their input on operational requirements and alerting, and ask how you can set up a regular check-in to get their feedback. I find that ops folks are pleasantly surprised when dev teams ask them to get involved before anything catches fire. They’re usually eager to help.

I have less experience with involving security folks, partly because they’re less common than ops people, but the general approach is the same: reach out early in development and discuss how you can build security in, rather than adding it as an afterthought.

Prerequisites

Ally: Whole Team

Building for operation can be done by any team. For best results, you’ll need a team that includes people with operations and security skills, or good relationships with people who have those skills, and a management culture that values long-term thinking.

Indicators

When you build for operation:

Your team has considered and addressed potential security threats.
Alerts are targeted and relevant.
When production issues occur, your organization is prepared to respond.
Your software is resilient and stable.

Alternatives and Experiments

This practice is ultimately about acknowledging that “working software” involves running it in production, not just coding it, and deliberately building your software to accommodate production needs. The best way I know to do so is to involve people with operations and security expertise as part of the team.

You’re welcome to experiment with other approaches. Companies often have limited operations and security staff, so it can be difficult to put people with those skills on every team. Checklists and automated enforcement tools are a common substitute. In my experience, they’re a pale imitation. A better approach is to provide training to develop team members’ skills.

Before you experiment, try to work with at least one team that takes a true DevOps approach, with skilled people embedded in the team. That way, you’ll know how your experiments compare.

The Art of Agile^SM