James Shore: The Art of Agile Development: Version Control

The Art of Agile Development: Version Control

October 21, 2010

Book cover for “The Art of Agile Development, Second Edition” by James Shore. Published by O'Reilly. The cover artwork shows a water glass containing a small sapling. The sapling has small green leaves. There is a goldfish in the glass.

The second edition is now available! The Art of Agile Development has been completely revised and updated with all new material. Visit the Second Edition page for more information, or buy it on Amazon.

in 99 words

To support collective ownership, use a concurrent model of version control. Support time travel by storing tools, libraries, documentation, and everything else related to the project in version control. (On-site customers' files, too.) Keep the entire project in a single repository.

Avoid long-lived branches, particularly for customized versions; they'll cripple your ability to deliver on a timely schedule. Instead, use configuration files and build scripts to support multiple configurations.

Keep your repository clean: never check in broken code. All versions should build and pass all tests. "Iteration" versions are ready for stakeholders; "release" versions are production-ready.

Rant

Paranoia, Control, and $30,000 of Tooling

Full Text

Version Control

Audience: Programmers

We keep all of our project artifacts in a single, authoritative place.

To work as a team, you need some way to coordinate your source code, tests, and other important project artifacts. A version control system provides a central repository which helps coordinate changes to files and also provides a history of changes.

A project without version control may have snippets of code scattered among developer machines, networked drives, and even removable media. The build process may involve one or more people scrambling to find the latest versions of several files, trying to put them in the right places, and only succeeding through the application of copious caffeine, pizza, and stress.

Ally: Continuous Integration

A project with version control uses the version control system to mediate changes. It's an orderly process in which developers get the latest code from the server, do their work, run all of the tests to confirm that their code works, then check in their changes. This process, called continuous integration, occurs several times a day for each pair.

If you aren't familiar with the basics of version control, start learning now. Learning to use a version control system effectively may take a few days, but the benefits are so great that it is well worth the effort.

Version Control Terminology

Different version control systems use different terminology. Here are the terms I use throughout this book:

Repository

The repository is the master storage for all of your files and and their history. It's typically stored on the version control server. Each stand-alone project should have its own repository.

Sandbox

Also known as a working copy, a sandbox is what team members work out of on their local development machines. (Don't ever put a sandbox on a shared drive. If other people want to develop, they can make their own sandbox.) The sandbox contains a copy of all the files in the repository from a particular point in time.

Check out

To create a sandbox, check out a copy of the repository. In some version control systems, this term means "update and lock".

Update

Update your sandbox to get the latest changes from the repository. You can also update to a particular point in the past.

Lock

A lock prevents anybody from editing a file but you.

Check in or commit

Check in the files in your sandbox to save them into the repository.

Revert

Revert your sandbox to throw away your changes and return to the point of your last update. This is handy when you've broken your local build and can't figure out how to get it working again. Sometimes reverting is faster than debugging, especially if you have check in recently.

Tip or head

The tip of the repository contains the latest changes that have been checked in. When you update your sandbox, you get the files at the tip. (This changes somewhat when you use branches.)

Tag or label

A tag marks a particular time in the history of the repository, allowing you to easily access it again.

Roll back

Roll back a check-in to remove it from the tip of the repository. The mechanism for doing so varies depending on the version control system you use.

Branch

A branch occurs when you split the repository into distinct "alternate histories," a process known as branching. All the files exist in each branch, and you can edit files in one branch independently of all other branches.

Merge

A merge is the process of combining multiple changes and resolving any conflicts. If two programmers change a file separately and both check it in, the second programmer will need to merge in the first person's changes.

Concurrent Editing

If multiple developers modify the same file without using version control, they're likely to accidentally overwrite each other's changes. To avoid this pain, some developers turn to a locking model of version control: when they work on a file, they lock it to prevent anyone else from making changes. The files in their sandboxes are read-only until locked. If you have to check out a file in order to work on it, then you're using a locking model.

Ally: Collective Code Ownership

While this approach solves the problem of accidentally overwriting changes, it can cause other, more serious problems. A locking model makes it difficult to make changes. Team members have to carefully coordinate who is working on which file, and that stifles their ability to refactor and make other beneficial changes. To get around this, teams often turn to strong code ownership, the worst of the code ownership models because only one person has the authority to modify a particular file. Collective code ownership is a better approach, but it's very hard to do if you use file locking.

Instead, use a concurrent model of version control. This model allows two people to edit the same file simultaneously. The version control system automatically merges their changes—nothing gets overwritten accidentally. If two people edit the exact same lines of code, the version control system prompts them to merge the two lines manually.

Allies: Continuous Integration; Ten-Minute Build

Automatic merges may seem risky. They would be risky if it weren't for continuous integration and the automated build. Continuous integration reduces the scope of merges to a manageable level, and the build, with its comprehensive test suite, confirms that merges work properly.

Time Travel

One of the most powerful uses of a version control system is the ability to go back in time. You can update your sandbox with all the files from a particular point in the past.

This allows you to use diff debugging. When you find a challenging bug that you can't debug normally, go back in time to an old version of the code when the bug didn't exist. Then go forward and backwards until you isolate the exact check-in that introduced the bug. You can review the changes in that check-in alone to get insight into the cause of the bug. With continuous integration, the number of changes will be small.

A powerful technique for diff debugging is the use of the binary chop, in which you cut the possible number of changes in half with each test. If you know version 500 doesn't have the bug and your current version, 700, does, then check version 600. If the bug is present in version 600, it was introduced somewhere between 500 and 599, so now check version 550. If the bug is not present in version 600, it was introduced somewhere between version 601 and 700, so check version 650. By applying a bit of logic and halving the search space each time, you can quickly isolate the exact version.

A bit of clever code could even automate this search with an automated test.¹

¹Thanks to Andreas Kö for demonstrating this.

Time travel is also useful for reproducing bugs. If somebody reports a bug and you can't reproduce it, try using the same version of the code that the reporter is using. If you can reproduce the behavior in the old version but not in the current version, especially with a unit test, you can be confident that the bug is and will remain fixed.

Whole Project

It should be obvious that you should store your source code in version control. It's less obvious that you should store everything else in there, too. Although most version control systems allow you to go back in time, it doesn't do you any good unless you can build the exact version you had at that time. Storing the whole project in version control—including the build system—gives you the ability to re-create old versions of the project in full.

As much as possible, keep all of your tools, libraries, documentation, and everything else related to the project in version control. Tools and libraries are particularly important. If you leave them out, at some point you'll update one of them, and then you'll no longer be able to go back to a time before the update. Or, if you do, you'll have to painstakingly remember which version of the tool you used to use and manually replace it.

For similar reasons, store the whole project in a single repository. Although it may seem natural to split the project into multiple repositories—perhaps one for each deliverable, or one for source code and one for documentation—this approach increases the opportunities for things to get out of sync.

Perform your update and commit actions on the whole tree as well. Typically, this means updating or committing from the top-level directory. It may be tempting to commit only the directory you've been working in, but that leaves you vulnerable to the possibility of having your sandbox split across two separate versions.

Leave generated code out of the repository.

The only project-related artifact I don't keep in version control is generated code. Your automated build should should re-create generated code automatically.

There is one remaining exception to what belongs in version control: code you plan to throw away. Spike solutions (see Spike Solutions in Chapter 9), experiments, and research projects may remain unintegrated, unless they produce concrete documentation or other artifacts that will be useful for the project. Check in only the useful pieces of the experiment. Discard the rest.

Customers and Version Control

Customer data should go in the repository, too. That includes documentation, notes on requirements (see Incremental Requirements in Chapter 9), technical writing such as manuals, and customer tests (see Customer Tests in Chapter 9).

When I mention this to programmers, they worry that the version control system will be too complex for customers to use. Don't underestimate your customers. While it's true that some version control systems are very complex, most have user-friendly interfaces. For example, the TortoiseSvn Windows client for the open-source Subversion version control system is particularly nice.

Ally: Sit Together

Even if your version control system is somewhat arcane, you can always create a pair of simple shell scripts or batch files—one for update and one for commit—and teach your customers how to run them. If you sit together, you can always help your customers when they need something more sophisticated, such as time travel or merging.

Keep It Clean

Always check in code that builds and passes all tests.

One of the most important ideas in XP is that you keep the code clean and ready to ship. It starts with your sandbox. Although you have to break the build in your sandbox in order to make progress, confine it to your sandbox. Never check in code that breaks the build. This allows anybody to update at any time without worrying about breaking their build—and that, in turn, allows everyone to work smoothly and share changes easily.

You can even minimize broken builds in your sandbox. With good test-driven development, you're never more than five minutes away from a working build.

Because your build automatically creates a release, any code that builds is theoretically ready to release. In practice, the code may be clean but the software itself won't be ready for the outside world. Stories will be half done, user interface elements will be missing, and some things won't entirely work.

By the end of each iteration, you will have finished up all of these lose ends. Each story will be "done done," and you will deploy the software to stakeholders as part of your iteration demo. This software represents a genuine increment of value for your organization. Make sure you can return to it at any time by tagging the tip of the repository. I usually name mine "Iteration X," where X is the number of iterations we have conducted.

Not every end-of-iteration release to stakeholders gets released to customers. Although it contains completed stories, it may not have enough to warrant a release. When you conduct an actual release, add another tag to the end-of-iteration build to mark the release. I usually name mine "Release Y," where Y is the number of releases we have conducted.

Although your build should theoretically work from any sandbox, save yourself potential headaches by performing release builds from a new, pristine sandbox. The first time you spend an hour discovering that you've broken a release build by accidentally including an old file, you'll resolve never to do it again.

To summarize, your code goes through four levels of completion:

Broken. This only happens in your sandbox.
Builds and passes all tests. All versions in your repository are at least at this level.
Ready to demo to stakeholders. Any version marked with the "Iteration X" tag is ready for stakeholders to try.
Ready to release to real users and customers. Any version marked with the "Release Y" tag is production-ready.

Single Codebase

One of the most devastating mistakes a team can make is to duplicate their codebase. It's easy to do. First, a customer innocently requests a customized version of your software. To deliver this version quickly, it seems simple to duplicate the codebase, make the changes, and ship it. Yet that copy and paste customization doubles the number of lines of code that you need to maintain.

Duplicating your codebase will cripple your ability to deliver.

I've seen this cripple a team's ability to deliver working software on a timely schedule. It's nearly impossible to recombine a duplicated codebase without heroic and immediate action. That one click doesn't just lead to technical debt; it leads to indentured servitude.

Unfortunately, version control systems actually make this mistake easier to make. Most of these systems provide the option to branch your code—that is, to split the repository into two separate lines of development. This is essentially the same thing as duplicating your codebase.

Branches have their uses, but using them to provide multiple customized versions of your software is risky. Although version control systems provide mechanisms for keeping multiple branches synchronized, doing so is tedious work that steadily becomes more difficult over time. Instead, design your code to support multiple configurations. Use a plug-in architecture, a configuration file, or factor out a common library or framework. Top it off with a build and delivery process that creates multiple versions.

Appropriate Uses of Branches

Branches work best when they are short-lived or when you use them for small numbers of changes. If you support old versions of your software, a branch for each version is the best place to put bug fixes and minor enhancements for those versions.

Some teams create a branch in preparation for a release. Half of the team continues to perform new work, and the other half attempts to stabilize the old version. In XP, your code shouldn't require stabilization, so it's more useful to create such a branch at the point of release, not in preparation for release.

To eliminate the need for a branch entirely, automatically migrate your customers and users to the latest version every time you release.

Branches can also be useful for continuous integration and other code management tasks. These private branches live for less than a day. You don't need private branches to successfully practice XP, but if you're familiar with this approach, feel free to use it.

Questions

Which version control system should I use?

There are plenty of options. In the open source realm, Subversion is popular and particularly good when combined with the TortoiseSvn front end. Of the proprietary options, Perforce gets good reviews, although I haven't tried it myself.

Avoid Visual SourceSafe (VSS). VSS is a popular choice for Microsoft teams, but it has numerous flaws and problems with repository corruption—an unacceptable defect in a version control system.

Your organization may already provide a recommended version control system. If it meets your needs, use it. Otherwise, maintaining your own version control system isn't much work and requires little of a server besides disk space.

Should we really keep all our tools and libraries in version control?

Yes, as much as possible. If you install tools and libraries manually, two undesirable things will happen. First, whenever you make an update, everyone will have to manually update their computer. Second, at some point in the future you'll want to build an earlier version and you'll spend several hours struggling to remember which versions of which tools you need to install.

Some teams address these concerns by creating a "tools and libraries" document and putting it in source control, but it's a pain to keep such a document up-to-date. Keeping your tools and libraries in source control is a simpler, more effective method.

Some tools and libraries require special installation, particularly on Windows, which makes this strategy more difficult. They don't all need installation, though—some just come with an installer because it's a cultural expectation. See if you can use them without installing them, and try to avoid those that you can't easily use without special configuration.

For tools that require installation, I put their install packages in version control, but I don't install them automatically in the build script. The same is true for tools that are useful but not necessary for the build, such as IDEs and diff tools.

How can we store our database in version control?

Ally: Ten-Minute Build

Rather than store the database itself in version control, set up your build to initialize your database schema and migrate between versions. Store the scripts to do so in version control.

How much of our core platform should we include in version control?

In order for time travel to work, you need to be able to exactly reproduce your build environment for any point in the past. In theory, everything required to build should be in version control, including your compiler, language framework, and even database management system (DBMS) and operating system (OS). Unfortunately, this isn't always practical. I include as much as I can, but I don't usually include my DBMS or operating system.

Some teams keep an image of their entire OS and installed software in version control. This is an intriguing idea, but I haven't tried it.

With so many things in version control, how can I update as quickly as I need to?

Slow updates may be a sign of a poor-quality version control system. The speed of better systems depends on the number of files that have changed, not the total number of files in the system.

One way to make your updates faster is to be selective about what parts of your tools and libraries you include. Rather than including the entire distribution—documentation, source code, and all—include only the bare minimum needed to build. Many tools only need a handful of files to execute. Include distribution package files in case someone needs more details in the future.

How should we integrate source code from other projects? We have read-only access to their repositories.

If you don't intend to change their code and you plan on updating infrequently, you can manually copy their source code into your repository.

If you have more sophisticated needs, many version control systems will allow you to integrate with other repositories. Your system will automatically fetch their latest changes when you update. It will even merge your changes to their source code with their updates. Check your version control system's documentation for more details.

Be cautious of making local changes to third-party source code; this is essentially a branch, and it incurs the same synchronization challenges and maintenance overhead that any long-lived branch does. If you find yourself making modifications beyond vendor-supplied configuration files, consider pushing those changes upstream, back to the vendor, as soon as possible.

We sometimes share code with other teams and departments. Should we give them access to our repository?

Certainly. You may wish to provide read-only access unless you have well-defined ways of coordinating changes from other teams.

Results

With good version control practices, you are easily able to coordinate changes with other members of the team. You easily reproduce old versions of your software when you need to. Long after your project has finished, your organization can recover your code and rebuild it when they need to.

Contraindications

You should always use some form of version control, even on small one-person projects. Version control will act as a backup and protect you when you make sweeping changes.

Allies: Ten-Minute Build; Test-Driven Development; Continuous Integration

Concurrent editing, on the other hand, can be dangerous if an automatic merge fails and goes undetected. Be sure you have a decent build if you allow concurrent edits. Concurrent editing is also safer and easier if you practice continuous integration and have good tests.

Alternatives

There is no practical alternative to version control.

You may choose to use file locking rather than concurrent editing. Unfortunately, this approach makes refactoring and collective code ownership very difficult, if not impossible. You can alleviate this somewhat by keeping a list of proposed refactorings and scheduling them, but the added overhead is likely to discourage people from suggesting significant refactorings.

The Art of Agile^SM

The Art of Agile Development: Version Control

October 21, 2010

in 99 words

Rant

Full Text

Version Control

Version Control Terminology

Concurrent Editing

Time Travel

Whole Project

Customers and Version Control

Keep It Clean

Single Codebase

Appropriate Uses of Branches

Questions

Which version control system should I use?

Should we really keep all our tools and libraries in version control?

How can we store our database in version control?

How much of our core platform should we include in version control?

With so many things in version control, how can I update as quickly as I need to?

How should we integrate source code from other projects? We have read-only access to their repositories.

We sometimes share code with other teams and departments. Should we give them access to our repository?

Results

Contraindications

Alternatives

Further Reading