Skip to the content.

Learning Objectives

Moving files on and off the cluster

A common question is how do I get my data on to the cluster?

Depending on the analysis you are performing, the data you want could be located:

Next, we will introduce you to commands that help you copy files on and off of the cluster, in addition to some helpful ways of working with files that currently reside on the cluster.

Downloading external data using wget or curl

wget

You will sometimes find yourself wanting to download data from a website. There are two comparable commands that you can use to accomplish this task. The first one is wget and the most common syntax for using it is:

wget [http://www.example.com/data]

Let’s move to our scratch space and try to do this in order to download the reference genome for E. coli from NCBI:

cd_scratch
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2/GCA_000005845.2_ASM584v2_genomic.fna.gz

Note: If you lose your connection during the download process and would like to resume it midway through, the -c will continue the download where it left off:

wget -c [http://www.example.com/data]

curl

A common alternative to wget is curl and many purposes they are extremely similiar and which you decide to use is a matter of personal preference. The general syntax is:

curl -L -O [http://www.example.com/data]

The -O option will use the filename given on the website as its filename to write to. Alternatively, if you wanted to name it something different, you can use the -o option and then follow it with the preferred name like:

curl -L -o preferred_name  [http://www.example.com/data]

The -L option tells curl to follow any redirections the HTML address gives to the data. For example, webpage might redirect from it’s “Download” button to a different page when the data is actually stored. This -L option tells curl to follow this redirection. In some cases it is needed.

Let’s use curl to download the E. coli gene annotations from NCBI:

curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/005/845/GCA_000005845.2_ASM584v2/GCA_000005845.2_ASM584v2_genomic.gff.gz

Note: You can also do this exact same task with wget. We are just demonstrating curl in order to give you an option.

Lastly, if you connection gets lost midway through a transfer, you can use the -C option followed by - to resume the download where it left off. For example:

curl -C - -L -O [http://www.example.com/data]

curl versus wget

For many purposes curl and wget are similar, but there are some small differences:

1) In curl you can use the -O option multiple times to carry out multiple downloads simulatenously.

curl -L -O [http://www.example.com/data_file_1] -O [http://www.example.com/data_file_2]

2) In wget you can recursively download a directory (meaning that you also download all subdirectories) with the -r. Typically this isn’t super useful because the source will typically pack this up all into a compressed package, but nonetheless it is something that wget can do that curl cannot do.

In general, curl has a bit more options and flexibility than wget but the vast majority, if not all, of those options are far beyond the scope of this module and for this module it comes down to a personal preference. That being said, as you download data and tools from various sources you may see different developers having different preferences. This is mostly a primer for understanding both of the cases that you will likely run into.

Checking file transfer success with md5sum

When you are copying files between two locations and you want to ensure the copying went smoothly or are interested to see if two files are the same. Checksums can be thought of as an alphanumeric fingerprint for a file and they are used to ensure that two files are the same – in other words, you can ensure that the file you downloaded did not get truncated or corrupted during the download. It is common for people/insitutions to provide a list of md5sums for files that are availible to download. md5sum is one common checksum. Importantly, it is theorectically possible that two different files have the same md5sum, but it is practically nearly impossible. The syntax for checking the md5sum of a file is:

md5sum <file>

Let’s check the md5sum for the E. coli reference genome that we downloaded.

md5sum GCA_000005845.2_ASM584v2_genomic.fna.gz

This should hopefully return:

7e69874199f23fd21b060dc0b2b72321  GCA_000005845.2_ASM584v2_genomic.fna.gz

Now, we can cross-reference this md5sum with the md5sum that NCBI provides for the file:

7e69874199f23fd21b060dc0b2b72321  ./GCA_000005845.2_ASM584v2_genomic.fna.gz

If these match you can have confidence that the file download was complete.


Exercise

  1. Check the md5sum for the GCA_000005845.2_ASM584v2_genomic.gff.gz. Does it match the md5sum on the NCBI website for this file?

Copying files to and from the cluster

It is really important that when you are transferring data using scp or rsync (discussed next), that you utilize the transfer nodes for your file transfer. The transfer nodes have the address of:

username@transfer.rc.hms.harvard.edu

scp

Similar to the cp command to copy there is a command that allows you to securely copy files between computers. The command is called scp and allows files to be copied to, from, or between different hosts. It uses ssh for data transfer and provides the same authentication and same level of security as ssh.

# DO NOT RUN
# Usage example for scp to copy a file on O2 to your local machine
scp username@transfer.rc.hms.harvard.edu:/path/to/file_on_O2 /path/to/directory/local_machine 

Let’s try copying over the E. coli reference genome that we downloaded from NCBI from our scratch space on O2 to our local machine.

First open up a new terminal window, that is not connected to O2.

# Be sure to replace `<username>` twice with your O2 username and `<first_letter_of_username>` with the first letter of your username
scp username@transfer.rc.hms.harvard.edu:/n/scratch/users/<first_letter_of_username>/<username>/GCA_000005845.2_ASM584v2_genomic.fna.gz  .

You will be prompted for your password and silently prompted by approval for Duo. Now see that the file has transferred over:

$ ls GCA_000005845.2_ASM584v2_genomic.fna.gz

NOTE: Windows users may encounter a permissions error when using scp to copy over locally. We are not sure how to troubleshoot this, but will update materials as we obtain more information.

From local machine to the O2 cluster using scp

If you wanted to do the reverse and upload the file to our home directory, you will need to indicate the transfer node in your destination argument as show below:

### DO NOT RUN
scp GCA_000005845.2_ASM584v2_genomic.fna.gz username@transfer.rc.hms.harvard.edu:~

rsync

rsync is used to copy or synchronize data between directories. This can be on/off the cluster or within the cluster filesystem. It has many advantages over cp, scp etc.

Salient Features of rsync

When copying over large datasets to or from a remote machine, rsync works similarly to scp as shown below. This time we are going to copy the reference genome back onto the cluster using rsync.

# DO NOT RUN
# Usage example for ryncs to copy a file on local machine to O2 using rsync
rsync -av - e ssh /path/to/directory/local_machine username@transfer.rc.hms.harvard.edu:/path/to/file_on_O2

Let’s first change the name of our file so we can distinguish it from the file that is already present on O2. In your Terminal window that you were using with the scp example above, type in:

mv GCA_000005845.2_ASM584v2_genomic.fna.gz renamed_GCA_000005845.2_ASM584v2_genomic.fna.gz
# Be sure to replace <username> with your username
rsync -av -e ssh renamed_GCA_000005845.2_ASM584v2_genomic.fna.gz <username>@transfer.rc.hms.harvard.edu:~

You will again be prompted for your password and Duo approval. We should now be able to see that the newly named reference genome has been uploaded to O2 by using ls in the Terminal window that is connected to O2:

ls ~

From O2 cluster to a local machine using rsync

If you wanted to do the reverse and download the file to our home directory, you will need to indicate the transfer node in your destination argument as show below:

# Be sure to replace `<username>` twice with your O2 username and `<first_letter_of_username>` with the first letter of your username
rsync -av -e ssh <username>@transfer.rc.hms.harvard.edu:/n/scratch/users/<first_letter_of_username>/<username>/GCA_000005845.2_ASM584v2_genomic.gff.gz .

More helpful information and examples using rsync can be found at this link

A file transfer alternative without using the command line interface

If you have a few fairly small files to move on/off the cluster, you can also use a program like Filezilla to copy over files. It is not recommended for larger files or big datasets as it can timeout and cause issues. Note that here the transfer nodes are also required.

Securely sharing data across clusters with Globus

Perhaps you are working on a project with a collaborator who works in Europe and you want to securely share your data. This task can be quite difficult without one of you distributing your username and password to the other, which would be a security risk, or posting the data publicly where others could download it and thus the data transfer isn’t private. This is where Globus comes in. Globus is a file sharing tool for sharing data with between users on different computing clusters. One example of this is that it is sometimes used with retrieving sequencing data from sequencing cores.

At Harvard, Globus can be used for:

HMS-RC’s information page on Globus can be found here and for the purposes of this module we won’t dive anymore into it, but it is important to know that this tool exists, particularly when collaborating with researchers who don’t have HMS creditials.

O2 shared reference files (iGenome)

Another O2 quality-of-life feature is that HMS-RC has a directory called /n/shared_db/igenome/03032016/ which holds reference genomes, alignment indexes, annotation files and much more for many common model organisms. Therefore, oftentimes you don’t even need to download many of the reference files that you will need for your analysis.


Exercise

  1. Navigate to the /n/shared_db/igenome/03032016/ directory and see if you can find the organism that you study there.

Let’s explore the human reference by:

cd /n/shared_db/igenome/03032016/Homo_sapiens/UCSC/hg38/

ls -l

We can we see that there are two directories:

total 0
drwxrwxr-x 2 ld32 ritg 4096 Aug 18  2015 Annotation
drwxrwxr-x 2 ld32 ritg 4096 Oct 16  2019 Sequence

Let’s further go into the Seqeunce directory:

cd Sequence

ls -l

From here we can see that this hold directories containing indices for Bowtie, Bowtie2 and STAR aligners, as well as the whole genome reference sequence.

total 0
drwxrwxr-x 2 ld32 ritg 4096 Aug 18  2015 AbundantSequences
drwxrwxr-x 2 ld32 ritg 4096 Aug 18  2015 Bowtie2Index
drwxrwxr-x 2 ld32 ritg 4096 Aug 18  2015 BowtieIndex
drwxrwxr-x 2 ld32 ritg 4096 Aug 18  2015 BWAIndex
drwxrwxr-x 2 ld32 ritg 4096 May 14  2020 Chromosomes
drwxrwxr-x 2 ld32 ritg 4096 Oct 16  2019 star2.5.4a
drwxrwxr-x 2 ld32 ritg 4096 Jul 12  2020 WholeGenomeFasta

We can look inside of the WholeGenomeFasta:

ls -l WholeGenomeFasta

And see:

total 3087891
-rwxrwxr-x 1 ld32  ritg      32798 Aug 18  2015 genome.dict
-rwxrwxr-x 1 ld32  ritg 3161924569 Aug 18  2015 genome.fa
-rw-rw-r-- 1 ak150 ritg       7802 Jul 12  2020 genome.fa.fai
-rwxrwxr-x 1 ld32  ritg      34045 Aug 18  2015 GenomeSize.xml

The genome FASTA file is here genome.fa, along with a dictionary (genome.dict), FASTA index (genome.fa.fai) and genome size XML (GenomeSize.xml).

Let’s take a brief look at the Bowtie2Index directory:

ls -l Bowtie2Index

The contents of this directory might seem mostly normal at first glance:

total 4068535
-rwxrwxr-x 1 ld32 ritg 982504890 Aug 18  2015 genome.1.bt2
-rwxrwxr-x 1 ld32 ritg 733719120 Aug 18  2015 genome.2.bt2
-rwxrwxr-x 1 ld32 ritg     10880 Aug 18  2015 genome.3.bt2
-rwxrwxr-x 1 ld32 ritg 733719113 Aug 18  2015 genome.4.bt2
lrwxrwxrwx 1 ld32 ritg        29 Aug 18  2015 genome.fa -> ../WholeGenomeFasta/genome.fa
-rwxrwxr-x 1 ld32 ritg 982504890 Aug 18  2015 genome.rev.1.bt2
-rwxrwxr-x 1 ld32 ritg 733719120 Aug 18  2015 genome.rev.2.bt2

It is mostly just indices for Bowtie2 (files ending in .bt2), but there is a particular file of interest here:

lrwxrwxrwx 1 ld32 ritg        29 Aug 18  2015 genome.fa -> ../WholeGenomeFasta/genome.fa

This is the first place we have now seen a symbolic link. genome.fa is the link name and it is pointing, with a relative path, to the reference genome FASTA file that we were just looking at (../WholeGenomeFasta/genome.fa). We are going to talk more about symbolic links, sometimes called symlinks, in the next section.

Sometimes you will have files or folders containing files on a computing cluster that you would like to have in a few places. A few examples where this might happen:

1) The software package you are using wants you to have a specific directory structure for the data, but you do not want to move the data.

2) There is a large file that is used as input to many different analyses and you would rather not have duplicate copies of it.

Both of these are good places to implement symbolic links. Symbolic links can be thought of as arrows for computer, pointing to where a file or folder is located, as we saw in the above example with the human reference genome. Symbolic links require negliable amounts of space and thus mean you can use them to point to large files elsewhere on the computing system. If these symbolic links get deleted or broken, they have no impact on the data they they were pointing to. Additionally, they can be named anything, so it can be helpful if a software package is looking for a specific name for an input file.

Let’s go ahead and make our own symbolic link. First, navigate to our home directory:

cd ~

The syntax for making a symbolic link is:

ln -s </path/to/file_to_be_linked_to> </path/to/link_name>

Assume we want to make a symbolic link to our compressed E. coli reference annotation that we had downloaded to the scratch space and we are going to call it E_coli.gff.gz. We could do that:

ln -s /n/scratch/users/${USER:0:1}/${USER}/GCA_000005845.2_ASM584v2_genomic.gff.gz E_coli.gff.gz

When you now view this directory with ls -l, it will display the link like:

E_coli.gff.gz -> /n/scratch/users/<first_letter_of_username>/<username>/GCA_000005845.2_ASM584v2_genomic.gff.gz

If you want to keep the original file name as the link name you can use just the path to where the symbolic link will be created and not include a new <link_name>. Since we are putting the link in our current directory, we can use ., but it is not necessary.

Importantly, if the original file or folder is deleted or moved, the symbolic link will become broken. It is common on many distributions for symbolic links to blink if they becomes broken.

NOTE: The -s option is necessacary for creating a symbolic link. Without the -s option, a hard link is created and modifications to the linked file will be carried over to the original. Generally, speaking hard links are typically not very common.

Sometimes, you will no longer need a symbolic link – maybe the original file has been deleted or moved, rendering the link broken, or you are restructuring your directory. You can remove a symbolic link just as you would remove a file or directory, and it will not delete the original file:

rm E_coli.gff.gz

You can check to see for yourself that the original file was not deleted:

ls /n/scratch/users/${USER:0:1}/${USER}/GCA_000005845.2_ASM584v2_genomic.gff.gz

Note: to remove a symbolic link to a directory, you will need to use rm -r <symbolic_link>. Be very careful that you are in the directory that contains the symbolic link and not in the directory that contains the original directory, or you will delete the entire contents of the original directory!


Next Lesson »

Back to Schedule


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.