Approximate time: 30 minutes
Learning Objectives
- Create a job submission script to run Salmon on all samples in the dataset
Running Salmon on multiple samples
In class we talked in depth about how the Salmon algorithm works, and provided the command required to run Salmon on a single sample. In this lesson we walk through the steps required to efficiently run Salmon on all samples in the dataset. Unlike our experience with FastQC, where we could use one command and simply provide all files with the use of a wildcard (*
), Salmon is only able to take a single file as input.
Rather than typing out the Salmon command six times, we will use a for loop to iterate over all FASTQ files in our dataset (inside the raw_fastq
directory). Furthermore, rather than running this for
loop interactively, we will put the it inside a text file and create a job submission script.
Create a job submission script to run Salmon in serial
Let’s start by moving to our scripts
directory within ‘rnaseq’ and opening up a text file in vim
:
cd ~/rnaseq/scripts/
$ vim salmon_all_samples.sbatch
Begin the script starting with the shebang line.
#!/bin/bash
Exercise 1
- Add the Slurm directives ( i.e
#SBATCH
) to request specific resources for our job. The resources we need are listed below.
NOTE: Helpful resources include:
- Your job will use the
shared
partition - Request 6 cores to take advantage of Salmon’s multi-threading capabilities
- Request 12 hours of runtime
- Request 8G of memory
- Give your job the name
salmon_in_serial
- A standard output file
- A standard error file
- Add an email and request to be notified when the job is complete
Now that we have the resources requested, we can begin to add the commands into our shell script.
Exercise 2
- Add a line of code required to load the Salmon module
- Add a line of code to change directories to where the Salmon results will be output (be sure to use a full path here).
Add comments to your script liberally, wherever you feel it’s needed.
The last piece of the shell script is the for loop code provided below. Copy and paste this into your script.
for fq in ~/rnaseq/raw_data/*.fq
do
# create a prefix for the output file
samplename=`basename $fq .fq`
# run salmon
salmon quant -i /n/holylfs05/LABS/hsph_bioinfo/Everyone/Workshops/Intro_to_rnaseq/indicies/salmon_index \
-l A \
-r $fq \
-o ${samplename}.salmon \
--seqBias \
--useVBOpt \
--validateMappings
done
Note that our for loop is iterating over all FASTQ files in the raw_fastq
directory. For each file, a prefix is generated to name the output file and then the Salmon command is run with the same parameters as used in the single sample run.
Exercise 3
-
Add two additional parameters (as described below) to the current Salmon command (remember to use “
\
” if dissecting one command in multiple lines):-p
: specifies the number of processors or cores we would like to use for multi-threading. What value will you provide here, knowing what we asked for in our Slurm directives?--numBootstraps
: specifies computation of bootstrapped abundance estimates. Bootstraps are required for isoform level differential expression analysis for estimation of technical variance. Here, you can set the value to 30.
NOTE:
--numBootstraps
is necessary if performing isoform-level differential expression analysis with Sleuth, but not for gene-level differential expression analysis. Due to the statistical procedure required to assign reads to gene isoforms, in addition to the random processes underlying RNA-Seq, there will be technical variability in the abundance estimates output from the pseudo-alignment tool [2, 3] for the isoform level abundance estimates (not necessary for gene-level estimates). Therefore, we would need technical replicates to distinguish technical variability from the biological variability for gene isoforms.The bootstraps estimate technical variation per gene by calculating the abundance estimates for all genes using a different sub-sample of reads during each round of bootstrapping. The variation in the abundance estimates output from each round of bootstrapping is used for the estimation of the technical variance for each gene.
- Save and close the script. This script is now ready to run.
$ sbatch salmon_all_samples.sbatch
- After you confirmed that the script runs as expected, copy and paste your final script to a txt file and submit that as part of your assignment.
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.