Arrays in Slurm
When we are working on large data sets our minds often drift back to an old Simpsons episode. Bart is in France and being taught to pick grapes. They show him a detailed technique and he does it successfully. Then they say:
We've all been here
A pipeline or process may seem easy or fast when you have 1-3 samples but totally daunting when you have 50. When scaling up you need to consider file overwriting, computational resources, and time.
One easy way to scale up is to use the array feature in slurm.
What is a job array?
Atlassian says this about job arrays on O2: “Job arrays can be leveraged to quickly submit a number of similar jobs. For example, you can use job arrays to start multiple instances of the same program on different input files, or with different input parameters. A job array is technically one job, but with multiple tasks.” link.
Sbatch vs. sh
So far we have run all of our scripts as sh script.sh which runs the script while we wait on the command line. However, for jobs that are going to take a very long time this is less than ideal because:
- You have to wait for the script to finish to get the command line back and run other tasks
- If you get disconnected from the cluster the job will automatically quit.
Running jobs with sbatch will immediately give you the command line back and are not dependant on you being connected to the cluster. We will not cover the basics of sbatch here but to learn how to write these scripts come to our module Shell tips and tricks on O2!
Running an array
Array jobs run simultaneously rather than one at a time which means they are very fast! Additionally, running a job array is very simple!
sbatch --array=1-10 my_script.sh
This will run my_script.sh 10 times with the job IDs 1,2,3,4,5,6,7,8,9,10
We can specify any job IDs we want.
sbatch --array=1,7,12 my_script.sh
This will run my_script.sh 3 times with the job IDs 1,7,12
Of course we don’t want to run the same job on the same input files over and over, that would be pointless. We can use the job IDs within our script to specify different input or output files. In bash the job id is given a special variable ${SLURM_ARRAY_TASK_ID}
How can I use ${SLURM_ARRAY_TASK_ID}?
The value of ${SLURM_ARRAY_TASK_ID}
is simply job ID. If I run
sbatch --array=1,7 my_script.sh
This will start two jobs, one where ${SLURM_ARRAY_TASK_ID}
is 1 and one where it is 7
There are several ways we can use this. If we plan ahead and name our files with these numbers (e.g., sample_1.fastq, sample_2.fastq) we can directly refer to these files in our script: sample_${SLURM_ARRAY_TASK_ID}.fastq
However, using the ID for input files is often not a great idea as it means you need to strip away most of the information that you might put in these names.
Instead we can keep our sample names in a separate file and use awk to pull the file names.
here is our complete list of long sample names which is found in our file samples.txt
:
DMSO_control_day1_rep1
DMSO_control_day1_rep2
DMSO_control_day2_rep1
DMSO_control_day2_rep2
DMSO_KO_day1_rep1
DMSO_KO_day1_rep2
DMSO_KO_day2_rep1
DMSO_KO_day2_rep2
Drug_control_day1_rep1
Drug_control_day1_rep2
Drug_control_day2_rep1
Drug_control_day2_rep2
Drug_KO_day1_rep1
Drug_KO_day1_rep2
Drug_KO_day2_rep1
Drug_KO_day2_rep2
If we renamed all of these to 1-16 we would lose a lot of information that may be helpful to have on hand. If these are all sam files and we want to convert them to bam files our script could look like this
file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
samtools view -S -b ${file}.sam > ${file}.bam
Since we have sixteen samples we would run this as
sbatch --array=1-16 my_script.sh
So what is this script doing? file=$(awk -v awkvar="${SLURM_ARRAY_TASK_ID}" 'NR==awkvar' samples.txt)
pulls the line of samples.txt
that matched the job ID. Then we assign that to a variable called ${file}
and use that to run our command.
We will come back to this awk one liner in our Needle in a Haystack module!
Enjoy Slurm!
This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.