James Shore: Alternatives to Acceptance Testing

Alternatives to Acceptance Testing

February 28, 2010

Updated 3 Mar 2010: Clarified "Customer Examples" section.

My essay on the problems with acceptance testing caused a bit of a furor. (Gojko Adzic, George Dinwiddie, Ron Jeffries.) The biggest criticism was that, without acceptance testing, how do you know that the software continues to work? The classic approach, manual regression testing, is a huge burden that only increases over time.

It's true. Manual regression testing is not a good idea. I'm not recommending it. But that's only one of the alternatives to automated acceptance testing.

Every software development practice has a cost and a benefit. The trick in designing your ideal method is to choose families of practices that combine to maximize benefits and minimize costs. When I said, "acceptance testing tools cost more than they're worth. I no longer use it or recommend it," I meant that I've found alternatives that provide the same benefit for lower cost.

This is what I do instead.

My Goal

When it comes to testing, my goal is to eliminate defects. At least the ones that matter. (Netscape 4.01 users, you're on your own.) And I'd much rather prevent defects than find and fix them days or weeks later.

I think of defects as coming from four sources: programmer errors, design errors, requirements errors, and systemic errors. When trying to eliminate defects, I look for practices that address these four causes.

Programmer errors occur when a programmer knows what to program but doesn't do it right. His algorithm might be wrong, he might have made a typo, or he may have made some other mistake while translating his ideas into code.

Design errors create breeding grounds for bugs. According to Barry Boehm (PDF), 20% of the modules in a program are typically responsible for 80% of the errors. They're not necessarily the result of an outright mistake, though. Often a module starts out well but gets crufty as the program grows.

Requirements errors occur when a programmer creates code that does exactly what she intends, but her intention was wrong. Somehow she misunderstood what needed to be done. Or, perhaps, no one knew what needed to be done. Either way, the code works as intended, but it doesn't do the right thing.

Systemic errors make mistakes easy. The team's work habits include a blind spot that lets subtle defects escape. In this environment, everybody thinks they're doing the right thing--programmers are confident that they're expressing their intent, and customers are confident that programmers are doing the right thing--yet defects occur anyway. Security holes are a common example.

I use the following techniques to eliminate these sources of errors. Although there are a lot of practices here, nearly all of them are practices my teams already use. I get them for free. Also, a long list doesn't translate to high cost. My experience with acceptance testing is that it's high cost, even though it doesn't take much time to describe. These practices, despite taking a lot of time to describe, are low cost. Many of them, such as TDD and fixing bugs promptly, actually decrease costs.

Finally, and most importantly, I can't take credit for these practices. Most of them are standard XP practices. All I've done is tweak the package a bit so I get the benefits of acceptance testing without the cost.

1. Preventing Programmer Errors

Test-Driven Development (TDD) is my defect-elimination workhorse. Not only does it improve the quality of my code, it gives me a comprehensive regression suite that I can use to detect future errors. My criteria for TDD is that I need to continue to write tests until I'm confident that the code does what I think it does and that the tests will alert me of improper changes in the future.

A lot of people think that TDD is just for unit testing. I don't use it that way. As I use TDD, I write three kinds of tests:

Unit tests, which are focused on a single class or set of interrelated classes. These tests don't cross process boundaries, don't involve I/O, and should run at a rate of hundreds per second. The number of tests scales linearly with the size of the code. Typically, 30-50% of my codebase is unit test code.

Focused integration tests, which test specific integration points in my program. For example, if I have a library that provides a database abstraction, I'll write tests that prove that I can connect to a database, perform queries, insert rows, and so forth. These tests run at a rate of tens per second. The number of tests scales proportionally to the number of ways my code talks to the outside world.

End-to-end integration tests, which test all layers of the application, from UI to database. Typically, the test will act like a user of the program, entering text and clicking buttons (or simulating clicks in the presentation layer), and then check the result by looking at the database or checking the program's output. These tests are very slow--seconds or even minutes per test--and tend to break when the program changes. I keep these to the minimum necessary for me to be confident that all of the pieces of the program fit together properly. Legacy systems written without TDD require a lot more end-to-end tests, but my goal is to replace them with unit tests and focused integration tests.

In addition to TDD, Pair Programming helps prevent programmer error through continuous code review. Two brains are better than one. Pairing also improves design quality, which reduces design errors.

I also implement policies that enable Energized Work. Tired, burned-out programmers make dumb mistakes. I use sensible work hours and encourage people to go home on time.

2. Preventing Design Errors

The techniques for eliminating programming errors also benefit the design, but I use additional practices to continually improve the design.

Simple Design emphasizes creating small, focused designs that only solve the problems we're facing today. Keeping the designs focused makes them easier to understand, reducing the likelihood of creating a bug breeding ground.

Incremental Design and Architecture works hand-in-hand with Simple Design by enabling us to grow our initial, simple designs as our needs become more complex. The two practices together are sometimes called Evolutionary Design.

Refactoring is an essential tool that allows us to change existing designs. Not only does it enable Incremental Design and Architecture, we can use it to improve poor quality designs. Even the best design gets crufty over time. I build Slack into every iteration and use the extra time to refactor away technical debt.

And finally, when my teams find a bug, we fix it right away--or decide that it isn't ever worth fixing. Not only do bugs get more expensive to fix over time, they're probably in the 20% of the code that causes 80% of the defects. After writing a test to reproduce the bug and fixing it, we look for ways to refactor the code so similar bugs are less likely in the future.

3. Preventing Requirements Errors

Acceptance tests are mostly about requirements. Since I don't use acceptance tests, I need lots of collaboration instead.

Whole Team

First, a Whole Team is essential. My teams are cross-functional, co-located, and have the resources necessary to succeed. In particular, my teams include customers and testers on-site as part of the team.

"On-site customers" is a broad term that includes product managers, product owners, domain experts, business analysts, interaction designers, and others. Essentially, these are the people on the team who understand and generate the software's requirements. Sometimes they make requirements up out of whole cloth, but more often they supplement their work with lots of stakeholder feedback.

When programmers need to understand requirements, they simply turn their head and ask the on-site customers. For anything that can't be explained easily, the customers provide concrete examples, cunningly called Customer Examples, to illustrate the point.

Customer Examples

Examples turn out to be a very powerful form of communication, and it's the legacy of Fit that's provided me with the most value. Typically, when you ask an on-site customer to explain something, she will try to explain the general principle. For example, she might say, "when a customer buys a ringtone, we either debit his pre-paid account or charge his credit card, but if he's using a credit card and the ringtone costs less than one dollar, we batch up the charges until the end of the month or until his total expenses are more than twenty dollars." Not only is this hard to understand, customers often leave out important details.

Instead, I have the customers illustrate their descriptions with concrete examples. I'll say, "Okay, so we have a customer named Fred, and he's paying with a credit card. He buys a ring-tone that costs fifty cents. He's already purchased 12 dollars of merchandise this month. How much do we charge his card for?" Once the dam has broken, the customers are able to provide additional examples in the same vein, and we keep going until we have examples for all of the edge cases we need.

Examples aren't just for domain rules. I also ask for work-flow examples and UI examples, such as screen mock-ups. Any non-trivial requirement can benefit from a concrete example. They can be informal, such as a discussion at a whiteboard, or formal, such as a Powerpoint mockup. I let the on-site customers decide how much formality they need--they often keep some of the examples for later reference.

The programmers use these examples to guide their work. Sometimes they'll use the examples directly in their tests. Domain rule examples are most likely to map directly to programmer concepts, particularly if the team uses Domain-Driven Design, a Ubiquitous Language, and Whole Value.

More often, though, the examples are too high-level to be used directly in the tests. Workflow examples, in particular, often correspond to the end-to-end integration tests I prefer to avoid. Instead, programmers use the examples as a guide, writing a multitude of more focused, programmer-centric tests as they use TDD. (As they're working, they're likely do a few manual walk-throughs of the program to make sure things come together as expected, but these "tests" aren't saved or repeated in any way. They're just part of the conscientious, paranoid programmer's workflow.)

The programmers stop when they're confident that the software reflects the customers' goals, as illustrated with the examples. Because they use TDD, they've also created a sophisticated suite of regression tests that are far more complete and thorough than the original examples. Even though the examples themselves aren't coded as tests, the regression suite will fail if the software breaks.

There's one problem with this approach: it doesn't close the feedback loop. Programmers work from customers' examples, and talk with the customers when they have questions, but how do they confirm that they really understood what the customers wanted? That's where Customer Review of the completed software comes in.

Aside: Customer Confidence

If tests are for the programmers' confidence, what gives the customers confidence? This was one of the hardest lessons for me to learn. I would show customers passing Fit tests, and they would say, "How do I know that the program is really doing what this says? How do I know you didn't just program it to give the right answer for these examples?" It was so frustrating! We were essentially being accused of fraud. I wanted to shout, "Just trust us!" But, of course, that was the problem. They didn't.

The question of customer confidence isn't one that any tool can resolve. It all comes down to interpersonal relations. Customers that don't trust the programmers aren't going to be assuaged by a test run. Customers who do trust the programmers don't need a test run to bolster that confidence.

In my experience, a reliable track record of shipping defect-free software is what improves customer confidence. Nothing more; nothing less. (Sorry.)

Customer Review

The check on the programmers' interpretation of requirements is continuous customer review. As programmers finish parts of the system, they pair with on-site customers to walk through the new functionality. This review often reveals minor flaws--sometimes purely cosmetic--that nobody thought to mention. Sometimes customers realize that what they asked for wasn't quite right. Reviews happen as soon as programmers have something to show, which exposes errors in time to be corrected. In addition, stories aren't marked "done done" until the on-site customers agree they're done.

Bring Testers Forward

Although examples help communicate complex requirements, customers typically have trouble thinking of edge cases. Programmers can help them spot holes, but I've experienced even better results from bringing testers into the mix.

In my experience, some testers are business-oriented testers: they're very interested in getting business requirements right. These testers work with the customers to uncover all the nit-picky details the customers would otherwise miss. They'll often prompt the customers to provide examples of edge cases during requirements discussions.

Other testers are more technically-oriented. They're interested in test automation and non-functional requirements. These testers act as technical investigators for the team. They create the massive testbeds that look at issues such as scalability, reliability, and performance.

4. Preventing Systemic Errors

If everyone does their job perfectly, the previous practices yield a system with no defects. Too bad perfection is so hard to come by. The team is sure to have blind spots: subtle areas where they don't do everything they need to, but they don't know it.

Escaped defects are a clear signal of problems in paradise. Although defects are inevitable--TDD alone has programmers finding and fixing defects every few minutes--defects that are found by end-users have "escaped." Every escaped defect is an indication of a flaw somewhere in the process.

Of course, we don't really want our end-users to be our beta testers. (Not usually, anyway.) That's where exploratory testing comes in.

Exploratory Testing

Exploratory Testing is a technique for finding surprising defects. Testers use their training, experience, and intuition to form hypotheses about where defects are likely to be lurking in the software, then they use a fast feedback loop to iteratively generate, execute, and refine test plans that expose those defects. It appears similar to ad-hoc testing to an untrained observer, but it's far more rigorous.

Some teams use exploratory testing to check the quality of their software. After a story's been coded, the testers do some exploratory testing, the team fixes bugs, and repeat. Once the testers don't find any more bugs, the story is done.

I do it differently. If the team is preventing defects like they should be, their code should be defect-free without additional testing. Instead, I use exploratory testing as a check on the quality of the process, not the software. Any defects found through exploratory testing are considered "escaped defects," indicating a flaw in the process. Only completed stories go through exploratory testing. In fact, once the team has demonstrated that it produces nearly zero defects, they release at-will, without a separate testing phase.

Fix the Process

Every escaped defect, whether found in exploratory testing or found by end-users, is a indication of a flaw in the process. When we find one, we fix our process.

The first thing is to analyze the defect, write a test that reproduces it, and fix it. While we're fixing it, we look at the design of the code and see if it needs improvement, too.

Next, we conduct Root-Cause Analysis. We ask ourselves, "What about our process allowed this defect to escape?" We continue asking "why" until we get to the root cause. Once we find it, we make changes to prevent that entire category of defects from happening again.

Sometimes we can prevent defects by changing the design of our system so that type of defect is impossible. For example, if find a defect that's caused by mismatch between UI field lengths and database field lengths, we might change our build to automatically generate the UI field lengths from database metadata.

When we can't make defects impossible, we try to catch them automatically, typically by improving our build or test suite. For example, we might create a test that looks at all of our UI field lengths and checks each one against the database.

And lastly, we might change our process to make defects less likely. For example, we might add an item to our "done done" checklist that reminds us to double-check any field lengths we changed.

'Tude

I also encourage an attitude among my teams... a bit of eliteness, even snobbiness. It goes like this: "Bugs happen to other people."

In a lot of teams, bugs are a way of life. But when you have the attitude that defects are inevitable, you aren't motivated to prevent them. They're just a nuisance to be fixed and forgotten as quickly as possible.

All of the practices I've described here take discipline and rigor. They're not necessarily difficult, but they break down if people are sloppy or don't care about their work. A culture of "No Bugs" helps the team maintain the discipline required, as does pair programming, a co-located team, and collective ownership.

The Complete Package

So that's how I achieve low defects without using acceptance tests. As I said in my previous essay, I've found acceptance testing to be high cost and low value. This set of practices, though long, is mostly free for teams using XP. For those teams, it costs less than acceptance testing and does a better job of preventing defects.

It's not easy, by any means. To achieve the value I'm describing here, you have to be rigorous in your approach to TDD and you have to commit wholeheartedly to having a cross-functional, co-located team. It's really based on a rigorous application of XP. If you can't do that, then perhaps automated acceptance tests are your best alternative.

But if you can to do this instead, I think you'll find that it gives you better results. It has for me.