Skip to the content.

Versioning your Data and Scripts

Learning Objectives

Setup

Download the workshop sample files zipped folder and unzip it.

What is Version Control?

Version control can be used to keep track of versions of a piece of work that either a single person is working on, or a shared document. It is designed to avoid a situation like the one noted below.

mydocument.txt
mydocument_v2.txt
mydocument_v3_rev-BHP.txt
mydocument_v8_Final?.txt

Some tools let us deal with this a bit better without creating a brand new file for every “save”. For example, using Microsoft Word’s “Track Changes” allows us to highlight the changes that are being made within the document. This can be helpful for when collaborating on a document so other people can see the changes being made. Alternatively, DropBox and Google Docs have a built-in “version history” feature which keeps older versions of documents (up to a certain point in time) so that you can revert back to if need be.

Version control systems on the other hand start with a base version of the document and then save only the changes you made at each step of the way by taking a so-called “snapshot”. A snapshot records information about when a file was saved, and all the differences between the current document and the previous version. The user (you) decides when these snapshots are collected, and this allows one to ‘rewind’ your file to an older version.

Version control is also very useful when collaborating. For example, two users can make independent sets of changes based on the same document and each have separate snapshots documenting their changes.

If there aren’t any conflicts (i.e updates to the same line), the two sets of changes can be “merged” back into the same base document.

Version Control Systems and Hosts

There are a lot of different version control systems available. These systems enable you to track changes locally or remotely (easy for collaborations), and there are hosts available for remote management of your “repositories”.

In this class we will be focusing on Git. Git is usually used for version control on a local computer and you do not need internet access to use it (internet access will be needed to download Git). The local version control setup with Git (or other version control systems) can be connected to an online setup that hosts repositories for sharing and collaboration.

GitHub is currently the most popular host of open source projects by number of projects and number of users. But other hosts exist, including SourceForge, BitBucket, and Gitlab, to name a few.

Why use Version Control?

The two main reasons to use version control are to:

Though version control was originally designed for dealing with code for large collaborative projects, there are many benefits to using it in other projects with text files too (.txt, .csv, .tsv). Some examples of projects making use of version control systems like Git: writing manuscripts, books or dissertations, and for collaboratively developing as well as distributing teaching materials (e.g. the Github repository for this class).

Note: Different Version Control systems handle different non-text files differently. In most cases Word documents, graphics files, data objects from R or STATA, etc., can be included but most tools have limited capabilities for saving version information for these.

Why Not use Dropbox or Google Drive?

Dropbox, Google Drive and other services offer some form of version control in their systems. There are times when this may be sufficient for your needs. However there are a number of advantages to using a version control system like Git, e.g. facilitating sharing/reproducibility and collaborations. Benefits of collaborating with Version Control include: