Bulk RNA-seq Data Analysis using High-Performance Computing (bulk RNA-seq Part I – FASTQ to counts)
- Understand the necessity for, and use of, the command line interface (bash) and HPC for analyzing high-throughput sequencing data.
- Understand best practices for designing an RNA-seq experiment and analysis the resulting data.
- FileZilla Client (make sure you get ‘FileZilla Client’)
- Plain text editor like Sublime text or similar
- These materials focus on the use of local computational resources at Harvard, which are only accessible to Harvard affiliates
- Non-Harvard folks can download the data and set up to work on their local clusters (with the help of local system administrators)
Instructions for Harvard researchers with access to HMS-RC’s O2 cluster
To run through the code in the lessons below, you will need to be logged into O2 and working on a compute node (i.e. your command prompt should have the word
compute in it).
- Log in using
ssh ecommonsID@o2.hms.harvard.eduand enter your password.
- Once you are on the login node, use
srun --pty -p interactive -t 0-2:30 --mem 1G /bin/bashto get on a compute node or as specified in the lesson.
- Proceed only once your command prompt has the word
- If you log out between lessons (using the
exitcommand twice), please follow points 1. and 2. above to log back in and get on a compute node when you restart with the self learning.
- Introduction to RNA-seq
- Shell basics review
- Working in an HPC environment - Review
- Best Practices in Research Data Management (RDM)
- Project Organization (using Data Management best practices)
- Quality Control of Sequence Data: Running FASTQC
- Experimental design considerations
- Quality Control of Sequence Data: Running FASTQC on multiple samples
- Quality Control of Sequence Data: Evaluating FASTQC reports
- Sequence Alignment Theory
- Quantifying expression using alignment-free methods (Salmon on multiple samples)
- QC with Alignment Data
- Documenting Steps in the Workflow with MultiQC
- Troubleshooting RNA-seq Data Analysis
- Automating the RNA-seq workflow
- Experimental design (one possible solution)
- FASTQC sbatch script
- FASTQC sbatch script .out file
- FASTQC sbatch script .err file.
- sbatch script to run salmon for all samples
- Automation Script
Building on this workshop
- Introduction to R workshop materials
- Bulk RNA-seq Part II (differential gene expression analysis) materials
- Video about statistics behind salmon quantification
- Advanced bash for working on O2:
- Obtaining reference genomes or transcriptomes
These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.