Searching and redirection

Learning objectives

Search for characters or patterns in a text file using the grep command
Write to and append a file using output redirection
Use the pipe (|) character to chain together commands

Approximate time: 60 minutes

Searching files with `grep` command

We went over how to search within a file using less. We can also search within files without even opening them, using grep. Simply put grep is a command-line utility for searching plain-text data sets for lines matching a pattern or regular expression (regex).

Why the word “grep”? It is a shortened form of globally search for a regular expression and print matching lines (g/re/p).

The syntax for grep is as follows: grep search_term filename. The pattern that we want to search is specified in search_term slot, and the file we want to search within is specified in the filename slot. Let’s give it a try by searching the FASTQ files in the raw_fastq directory.

FASTQ files contain the sequencing reads (nucleotide sequences) output from a sequencing facility. Each sequencing read in a FASTQ file is associated with four lines, with the first line (header line) always starting with an @ symbol. A whole FASTQ record for a single read should appear similar to the following:

@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
B?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################

More information about the FASTQ file format

Line Description

1 Read name preceded by ‘@’

2 The actual DNA sequence

3 Read name (same as line 1) preceded by a ‘+’ or just a ‘+’ sign

4 String of characters which represent the quality score of each nucleotide in line 2; must have same number of characters as line 2

You can find more information about FASTQ files in this lesson from our RNA-seq workshop.

Line	Description
1	Read name preceded by ‘@’
2	The actual DNA sequence
3	Read name (same as line 1) preceded by a ‘+’ or just a ‘+’ sign
4	String of characters which represent the quality score of each nucleotide in line 2; must have same number of characters as line 2

Suppose we want to see how many reads in our file Mov10_oe_1.subset.fq contain “bad” data, i.e. reads with 10 consecutive Ns (NNNNNNNNNN).

$ cd ~/unix_lesson/raw_fastq

$ grep NNNNNNNNNN Mov10_oe_1.subset.fq

We get back a lot of reads or lines of text!

What if we wanted to see the whole FASTQ record for each of these reads? We would need to modify the default behavior of grep and specify some argument/options. To look for all available options for the grep command, we can type grep --help (or man grep).

Looks like the -B and -A arguments for grep will be useful to return the matched line plus one before (-B 1) and two lines after (-A 2). Since each record is four lines, using these arguments will return the whole record. Within the whole record, the second line will be the actual sequence that has the pattern we searched for.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq

@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
--
@HWI-ST330:304:H045HADXX:1:1101:1106:89824
CACAAATCGGCTCAGGAGGCTTGTAGAAAAGCTCAGCTTGACANNNNNNNNNNNNNNNNNGNGNACGAAACNNNNGNNNNNNNNNNNNNNNNNNNGTTGG
+
?@@DDDDDB1@?:E?;3A:1?9?E9?<?DGCDGBBDBF@;8DF#########################################################

Exercises

Search for the sequence CTCAATGAGCCA in Mov10_oe_1.subset.fq. How many sequences do you find?
In addition to finding the sequence, how can you modify the command so that your search also returns the name of the sequence?
If you want to search for that sequence in all Mov10 replicate fastq files, what command would you use?

Answers

Question 1
grep CTCAATGAGCCA Mov10_oe_1.subset.fq
The output returns 5 sequences.

Question 2
grep -B 1 CTCAATGAGCCA Mov10_oe_1.subset.fq

Question 3
grep CTCAATGAGCCA Mov10*
The output returns 5 sequences for Mov10_oe_1 and 3 sequences for Mov10_oe_2.

More about searching text files with `grep`

Group separators (`--`), and how to remove them

You will notice that when we use the -B and/or -A arguments with the grep command, the output has some additional lines with dashes (--), these dashes work to separate your returned “groups” of lines and are referred to as “group separators”. This might be problematic if you are trying to maintain the FASTQ file structure or if you simply do not want them in your output. Using the argument --no-group-separator with grep will disable this behavior:

$ grep -B 1 -A 2 --no-group-separator NNNNNNNNNN Mov10_oe_1.subset.fq

Now your output should be returned as:

@HWI-ST330:304:H045HADXX:1:1101:1111:61397
CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
+
@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
@HWI-ST330:304:H045HADXX:1:1101:1106:89824
CACAAATCGGCTCAGGAGGCTTGTAGAAAAGCTCAGCTTGACANNNNNNNNNNNNNNNNNGNGNACGAAACNNNNGNNNNNNNNNNNNNNNNNNNGTTGG
+
?@@DDDDDB1@?:E?;3A:1?9?E9?<?DGCDGBBDBF@;8DF#########################################################

Which line number has a match?

Another useful option when using grep is the -n option, which will print out the line number from the file for the match. Adding this option to our previous command would work like this:

$ grep -B 1 -A 2 --no-group-separator -n NNNNNNNNNN Mov10_oe_1.subset.fq

This would return the output:

861213-@HWI-ST330:304:H045HADXX:1:1101:1111:61397
861214:CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNNCGAGGCCCTGGGGTAGAGGGNNNNNNNNNNNNNNGATCTTGG
861215-+
861216-@?@DDDDDDHHH?GH:?FCBGGB@C?DBEGIIIIAEF;FCGGI#########################################################
861953-@HWI-ST330:304:H045HADXX:1:1101:1106:89824
861954:CACAAATCGGCTCAGGAGGCTTGTAGAAAAGCTCAGCTTGACANNNNNNNNNNNNNNNNNGNGNACGAAACNNNNGNNNNNNNNNNNNNNNNNNNGTTGG
861955-+
861956-?@@DDDDDB1@?:E?;3A:1?9?E9?<?DGCDGBBDBF@;8DF#########################################################

A small thing you should note is that when using the -n option, lines that have a : after the line number correspond to the lines with the match (e.g 861214:CACTTGTAAGGGCAGGCCCCCTTCACCCTCCCGCTCCTGGGGGANNNNNNNNNNANNN...), while lines with a - after the line number are the surrounding lines retrieved when using the -A and/or -B options (e.g. 861213-@HWI-ST330:304:H045HADXX:1:1101:1111:61397).

Only returning lines that DO NOT match

One last grep option you might find quite useful is the -v option, which does an inverted match. This will return everything that does not match the pattern. In order to demonstrate this let’s first view a smaller file that you have.

$ cat ~/unix_lesson/other/Mov10_rnaseq_metadata.txt

This is the metadata file for the FASTQ data we are looking at. This should return:

sample	celltype
OE.1	Mov10_oe
OE.2	Mov10_oe
OE.3	Mov10_oe
IR.1	normal
IR.2	normal
IR.3	normal

Now, let’s consider the case that we didn’t want to output the “normal” cell type. We can use the -v option in grep like this:

$ grep -v normal ~/unix_lesson/other/Mov10_rnaseq_metadata.txt 

This will return all of lines that don’t have “normal” in the line.

sample	celltype
OE.1	Mov10_oe
OE.2	Mov10_oe
OE.3	Mov10_oe

Redirection

When we use grep, the matching lines print to the Terminal (also called Standard Output or “stdout”). If the result of the grep search is a few lines, we can view them easily, but if the output is very long, the lines will just keep printing and we won’t be able to see anything except the last few lines. You have experienced this when you searched for the pattern NNNNNNNNNN. How can we capture them instead?

We can do that with something called “redirection”. The idea is that we’re redirecting the output from the Terminal (all the stuff that went whizzing by) to something else. In this case, we want to save it to a file, so that we can look at it later.

Redirecting with “>”

The redirection command for writing something to file is >.

Let’s try it out and put all the sequences that contain ‘NNNNNNNNNN’ from the Mov10_oe_1.subset.fq into another file called bad_reads.txt.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq > bad_reads.txt

The prompt will go away for a little bit and then you will get it back, but nothing will be printed on the Terminal. But, you should have a new file called bad_reads.txt!

$ ls -l

Take a look at the file and see if it contains what you think it should.

NOTE: If we already had a file named bad_reads.txt in our directory, it would have overwritten it without any warning!

Redirecting (and appending) with “»”

The redirection command for appending something to an existing file is >>.

If we use >>, it will append to the existing content in a file, rather than overwrite it. This can be useful for saving more than one search. For example, the following command will append the bad reads from Mov10_oe_2 to the bad_reads.txt file that we just generated.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_2.subset.fq >> bad_reads.txt

$ ls -l

Did the size of the bad_reads.txt file change?

Since our bad_reads.txt file isn’t a raw_fastq file, we should move it to a different location within our directory. Let’s move it to the other folder using the command mv.

$ mv bad_reads.txt ../other/

Redirecting with pipes “|” (or piping)

The redirection command for using the output of a command as input for a different command is |.

The pipe key (|) is very likely not something you use very often (it is on the same key as the back slash (\), right above the key).

What | does is take the output from one command, e.g. the output from grep that went whizzing by and runs it through the command specified after it. When it was all whizzing by before, we wished we could just take a look at it! Maybe we could use less instead of the rapid scroll. Well, it turns out that we can! We can pipe the output grep command to less to slowly scroll through, or to head to just see the first few lines.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq | less

Now we can use the arrows to scroll up and down and use q to get out.

Or we could just take a glance to see what the output looks like.

$ grep -B 1 -A 2 NNNNNNNNNN Mov10_oe_1.subset.fq | head -n 5

Another thing we can also do is count the number of lines output by grep.

The wc command stands for word count. This command counts the number of lines, words and characters in the text input given to it. The -l argument will only count the number of lines instead of counting everything.

$ grep NNNNNNNNNN Mov10_oe_1.subset.fq | wc -l

Try it out without the -l to see the full output.

Tip - Similar to grep, you can type wc --help or man wc to see all options.

About Pipes:

The pipe is a very important/powerful concept in Shell
You can string along as many commands together as you like

The philosophy behind the three redirection operators (>, >>, |) you have learned so far is that none of them by themselves do a lot. BUT when you start chaining them together, you can do some really powerful things really efficiently.

To be able to use the shell effectively, becoming proficient in the use of the pipe and redirection operators is essential.

Practice with searching and piping/redirection

Let’s use the new commands in our toolkit and a few new ones to examine the “gene annotation” file, chr1-hg19_genes.gtf. We will be using this file to find the genomic coordinates of all known exons on chromosome 1.

$ cd ~/unix_lesson/reference_data/

Introduction to the GTF file format

Let’s explore our chr1-hg19_genes.gtf file a bit. What information does it contain?

$ less chr1-hg19_genes.gtf

chr1    unknown exon    14362   14829   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    14970   15038   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    15796   15947   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    16607   16765   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";
chr1    unknown exon    16858   17055   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "NR_024540"; tss_id "TSS7245";

The GTF (Gene Transfer Format) file is a tab-delimited file with information arranged in a very specific manner, usually for NGS analysis. That information is specifically about all the various entities or features found in a genome; in this case, on chromosome 1.

For more information on this file format, check out the Ensembl site.

The columns in the GTF file contain the genomic coordinates (location) of gene features (exon, start_codon, stop_codon, CDS) along with the associated gene_names, transcript_ids and protein_ids (p_id).

Given our understanding of splice isoforms, we know that a given exon can be part of 2 or more different transcripts generated from the same gene. In a GTF file, this exon will be represented multiple times, once for each transcript (or splice isoform). For example,

$ grep PLEKHN1 chr1-hg19_genes.gtf | head -n 5

This search returns two different transcripts of the same gene, NM_001160184 and NM_032129, that contain the same exon.

Two new commands!

Before our practice, let’s learn two new commands.

cut is a command that extracts columns from files.

We will use cut with the -f argument to specify which specific fields or columns from the dataset we want to extract. Let’s say we want to get the 1st column (chromosome number) and the 4th column (starting genomic position) from chr1-hg19_genes.gtf file, we can say:

$ cut -f 1,4 chr1-hg19_genes.gtf  | head

The cut command assumes our data columns are separated by tabs (i.e. tab-delimited). The chr1-hg19_genes.gtf is a tab-delimited file, so the default cut command works for us. However, data can be separated by other types of delimiters like “,” or “;”. If your data is not tab delimited, there is an argument you can add to your cut command, -d to specify the delimiter (e.g. -d "," with a .csv file).

sort is a command used to sort the contents of a file in a particular order. It has arguments that let you pick which column to sort by (-k), what kind of sorting you want to do (numeric n) and also if the result of the sorting should only return unique (-u) values. These are just 2 of the many features of the sort command.

Let’s do a quick test of how the -u argument returns only unique lines (and remove duplicates).

$ cut -f 1,4 chr1-hg19_genes.gtf | wc -l

How many lines are returned to you?

Click here to check your output

Your command should have returned 76,767 lines.

Now apply the sort -u command before counting the lines.

$ cut -f 1,4 chr1-hg19_genes.gtf | sort -u | wc -l

How many lines do you see now?

Click here to check your output

Your command should have returned 27,852 lines.

Exercises

Now that we know what type of information is inside the GTF file, let’s use the commands we have learned so far to answer a simple question about our data: how many unique exons are present on chromosome 1 using chr1-hg19_genes.gtf?

To determine the number of unique exons on chromosome 1, we are going to perform a series of steps as shown below. In this exercise, you need to figure out the command line for each step.

Extract only the genomic coordinates of exon features
Subset dataset to only keep genomic coordinates
Remove duplicate exons
Count the total number of exons

Your end goal is to have a single line of code, wherein you have strung together multiple commands using the pipe operator. But, we recommend that you do it in a stepwise manner as detailed below.

1. Extract only the genomic coordinates of exon features

We only want the exons (not CDS or start_codon features), so let’s use grep to search for the word “exon”. You should do sanity check on the first few lines of the output of grep by piping the result to the head command. Report the command you have at this stage.

2. Subset the extracted information from step 1 to only keep genomic coordinates

We will define the uniqueness of an exon by its genomic coordinates, both start and end. Therefore, from the step 1 output, we need to keep 4 columns (chr, start, stop, and strand) to find the total number of unique exons. The column numbers you want are 1, 4, 5, and 7.

You can use cut to extract those columns from the output of step 1. Report the command you have at this stage.

At this point, the first few lines should look like this:

chr1	14362	14829	-
chr1	14970	15038	-
chr1	15796	15947	-
chr1	16607	16765	-
chr1	16858	17055	-

3. Remove duplicate exons

Now, we need to remove those exons that show up multiple times for different transcripts. We can use the sort command with the -u option. Report the command you have at this stage.

Do you see a change in how the sorting has changed? By default the sort command will sort and what you can’t see here is that it has removed the duplicates. We will use step 4 to check if this step worked.

4. Count the total number of exons

First, check how many lines we would have without using sort -u by piping the output to wc -l.

Now, to count how many unique exons are on chromosome 1, we will add back the sort -u and pipe the output to wc -l. Do you observe a difference in number of lines?

Report the command you have at this stage and the number of lines you see with and without the sort -u.

Answers

Question 1
grep exon chr1-hg19_genes.gtf | head

Question 2
grep exon chr1-hg19_genes.gtf | cut -f 1,4,5,7 | head

Question 3
grep exon chr1-hg19_genes.gtf | cut -f 1,4,5,7 | sort -u | head

Summary!

For what we did in one line of code in the exercise at the end, we could have done it in multiple steps by saving the output of each command to a new file. But that would not be as efficient as using pipes! All we needed was the number of unique exons, and the intermediate output was not useful. Avoiding the storage of the data from each of the intermediate steps prevents clutter, and helps us avoid wasting precious storage space!

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The materials used in this lesson were derived from work that is Copyright © Data Carpentry (http://datacarpentry.org/). All Data Carpentry instructional material is made available under the Creative Commons Attribution license (CC BY 4.0).
Adapted from the lesson by Tracy Teal. Contributors: Paul Wilson, Milad Fatenejad, Sasha Wood, and Radhika Khetani for Software Carpentry (http://software-carpentry.org/)