Versioning your Data and Scripts
Previous: README
Setting up for today’s class
For this quick hands-on session we will be using a Graphical User Interface (GUI) to work with Git. Let’s start by:
- Downloading and installing GitKraken.
- Creating an account for yourself on GitHub. Please select the free/academic account, as this option has more long-term flexibility.
- Downloading the workshop sample files zipped folder and unzip it.
What is Version Control?
Version control can be used to keep track of versions of a piece of work that either a single person is working on, or a shared document. It is designed to avoid a situation like noted below.
mydocument.txt
mydocument_v2.txt
mydocument_v3_rev-BHP.txt
mydocument_v8_Final?.txt
Some tools let us deal with this a bit better without creating a new file for every “save”, such as Microsoft Word’s “Track Changes” or DropBox’s and Google Docs’ “version history” feature.
Version control systems start with a base version of the document and then save just the changes you made at each step of the way by taking a so-called “snapshot”. A snapshot records information about when a doc was saved, and all the changes between the current document and the previous version. The user (you) decides when these snapshots are collected, and this allows one to ‘rewind’ your file to an older version.
For example, two users can make independent sets of changes based on the same document and have 2 separate snapshots documenting the changes.
If there aren’t conflicts (i.e updates to the same line), the two sets of changes can be “merged” back into the same base document.
Version Control Systems and Hosts
There are a lot of different version control systems available. These systems enable you to track changes locally or remotely (easy for collaborations), and there are hosts available for remote management of your “repositories”.
In this class we will be focusing on Git. Git is usually used for version control on a local computer and you do not need internet access to use it (internet access will be needed to download Git). The local version control setup with Git (or other version control systems) can be connected to an online setup that hosts repositories for sharing and collaboration.
GitHub is currently the most popular host of open source projects by number of projects and number of users. But other hosts exist, including SourceForge, BitBucket, and Gitlab, to name a few.
Why use Version Control?
The two main reasons to use version control are to:
- Manage text/data/code effectively
- Collaborate efficiently
Though version control was originally designed for dealing with code for large collaborative projects, there are many benefits to using it in other projects with text files too (.txt
, .csv
, .tsv
). Some examples of projects making use of version control systems like Git: writing manuscripts, books or dissertations, and for collaboratively developing as well as distributing teaching materials (e.g. the Github repository for this class).
Note: Different Version Control systems handle different non-text files differently. In most cases Word documents, graphics files, data objects from R or STATA, etc., can be included but most tools have limited capabilities for saving version information for these.
Why Not use Dropbox or Google Drive?
Dropbox, Google Drive and other services offer some form of version control in their systems. There are times when this may be sufficient for your needs. However there are a number of advantages to using a version control system like Git, e.g. facilitating sharing/reproducibility and collaborations. Benefits of collaborating with Version Control include:
- Supports both text and programming languages, and gives the user much more control over how code is represented and disseminated
- Allows comments on every modification making it easier to revert to older version
- Allows you and others to navigate the history of a document readily
- Ensures that changes across multiple documents are coordinated and saved together (easy conflict resolution)
Next: Getting Started with Git using GitKraken
-
Materials used in these lessons are derived from Daniel van Strien’s “An Introduction to Version Control Using GitHub Desktop,”, Programming Historian, (17 June 2016). The Programming Historian ISSN 2397-2068, is released under the Creative Commons Attribution license (CC BY 4.0).*
-
Materials are also derived from Software Carpentry instructional material. These materials are also licensed under the Creative Commons Attribution license (CC BY 4.0).*