Skip to the content.

Versioning your Data and Scripts

What is Version Control?

We’ll start by exploring how version control can be used to keep track of what one person did and when. Even if you aren’t collaborating with other people, it is better than the scenario like this:

mydocument.txt
mydocumentversion2.txt
mydocumentwithrevision.txt
mydocumentfinal.txt

The system used for naming files may be more or less systematic. Adding dates makes it slightly easier to follow when changes were made:

mydocument2016-01-06.txt
mydocument2016-01-08.txt

Some word processors let us deal with this a little better, without creating a new file for every “save”, such as Microsoft Word’s “Track Changes” or Google Docs’ version history.

Version control systems start with a base version of the document and then save just the changes you made at each step of the way by taking a so-called “snapshot”. A snapshot records information about when the it was taken, but also about what changes occurred between different snapshots. You decide when these snapshots are collected, and this allows you to ‘rewind’ your file to an older version.

Once you think of changes as separate from the document itself, you can then think about “playing back” different sets of changes onto the base document and getting different versions of the document. For example, two users can make independent sets of changes based on the same document.

If there aren’t conflicts, you can even play two sets of changes onto the same base document.

A Version Control system is a tool that keeps track of these changes for us and helps us version and merge our files.

Why use Version Control?

The 2 main reasons to use version control are to:

Though version control was originally designed for dealing with code (.R, .pl. .py) there are many benefits to using it to with text files too (.txt, .csv, .tsv). For substantial work such as articles, books, or dissertations, version control makes a lot of sense.

Though not all of these benefits will be covered in this lesson, version controlling your document allows you to:

Note: Different Version Control systems handle different non-text files differently. In most cases Word documents, graphics files, data objects from R or STATA, etc., can be included but most tools have limited capabilities for these. Along these lines, it is considered best practice to not include large files/binaries in general; add-ons exist that will work better for storing large files.

Version control is particularly useful for facilitating collaboration. One of the original motivations behind version control systems was to allow different people to work on large projects together, and in the case of Git, to manage the Linux kernel source code.

Benefits of collaborating with Version Control include:

Why Not use Dropbox or Google Drive?

Dropbox, Google Drive and other services offer some form of version control in their systems. There are times when this may be sufficient for your needs. However there are a number of advantages to using a version control system like Git:

What are Git and GitHub?

Though often used synonymously, Git and GitHub are two different things:

Although GitHub’s focus is primarily on source code, other projects are increasingly making use of version control systems like GitHub to manage the work-flows of journal publishing, open textbooks, other humanities projects, and teaching materials.

Becoming familiar with GitHub will be useful not only for version controlling your own documents but will also make it easier to contribute and draw upon other projects which use GitHub.

In this lesson the focus will be on gaining an understanding of the basic aims and principles of Version Control by uploading and version controlling a plain text document using Github. This lesson will not cover everything but will provide a starting point to using Version Control (Git/Github or other).


Next Lesson