Learning Objectives

In this lesson, we will:

Utilize grep for searching for through files
Implement regular expressions within grep

grep

grep, short for Global Regular Expression Print, is a Unix command used to search files for characters that match a specified pattern, referred to as a string. In this lesson, we will demonstrate the simple use of grep to search for strings of characters within a file. Alternatively to providing the exact characters we want to search, we can also use regular expressions. Regular expressions (sometimes referred to as regex) are a string of characters that can be used as a pattern to match against.

Getting Started

Before we get started, let’s take a briefly look at the catch.txt file in a less buffer in order to get an idea of what the file looks like:

less catch.txt

In here, you can see that we have a variety of case differences and misspellings. These differences are not exhaustive, but they will be helpful in exploring how regular expressions are implemented in grep.

In its simplest use, the grep command only requires the pattern you are searching for followed by the file name. Let’s say our pattern is CAT, then our command would be:

grep CAT catch.txt

Flags for modifying the function of `grep`

There are additional flags for the grep command that are very useful and allow you to modify the output that is rerieved. For example, adding -c will count the number of matching lines rather than printing them all out to screen:

grep -c CAT catch.txt

There is a -E option when using grep that allows the user to use what is considered “extended regular expressons”. We won’t use too many of these types of regular expressions, but some do need this option. If you want to make it a habit to always use the -E option when using regular expressions in grep it is a bit more explicit. In this lesson, we will always include the -E option when using regular expressions.

To learn more about grep and its usage, you can type man grep or grep --help into the terminal.

Quotations

When using grep it is usually not required to put your search term in quotes. However, if you would like to use grep to do certain types of searches, it is best to wrap your search term in double quotations to avoid any issues with edge cases. Let’s briefly discuss the differences:

No quotation

If you are using grep to search and have whitespace (space or tabs) in your search, grep will treat the expression before the whitespace as the search term and the expression after the whitespace(s) as a file(s). As a result, if your search term doesn’t have whitespace it doesn’t matter if you put quotations, but if it does, then it won’t behave the way you’d like it to behave.

Single quotations

So grep doesn’t ever “see” quotation marks, but rather quotation marks are interpreted by bash first and then the result is passed to grep. The big advantage of using quotation marks, single or double, when using grep is that it allows you to use search expressions with whitespace in them. However, within bash, single-quotation marks (') are intepreted literally, meaning that the expression within the quotation marks will be interpreted by bash EXACTLY the way it is written. Notably, bash variables within single-quotations are NOT expanded. What we mean by this is that if you were to have a variable named at that holds AT:

at=AT

If you used grep while using single quotes like:

grep 'C${at}CH' catch.txt

It would only return:

C${at}CH

This is because it searches for the term without expanding (replacing the bash variable with what it stands for) the ${at} variable.

Double Quotations

Double quotations are typically the most useful because they allow the user to search for whitespace AND allows for bash to expand variables, so that now:

grep "C${at}CH" catch.txt

Returns:

CATCH

Additionally, if you would like to be able to literally search something that looked like a bash variable, you can do this just by adding a \ before the ${variable} to “escape” it from bash expansion. For example:

grep "C\${at}CH" catch.txt

Will return:

C${at}CH

`grep` with Regular Expressions

Ranges

Now that we have gotten some basics on the grep command out of the way, let’s start implementing some regular expressions into our grep commands.

A range of acceptable characters can be given to grep with []. Square brackets can be used to notate a range of acceptable characters in a position. For example:

grep -E "[BPL]ATCH" catch.txt

Will return:

PATCH
BATCH

It would have also returned LATCH had it been in the file, but it wasn’t.

You can also use - to denote a range of characters like:

grep -E "[A-Z]ATCH" catch.txt

Which will return every match that has an uppercase A through Z in it followed by “ATCH”:

PATCH
BATCH
CATCH
CAATCH
CAAATCH
CAAAATCH

You can also merge different ranges together by putting them right after each other or separating them by a | (in this case | stands for “or” and is not a pipe):

grep -E "[A-Za-z]ATCH" catch.txt
# OR
grep -E "[A-Z|a-z]ATCH" catch.txt

This will return:

PATCH
BATCH
CATCH
pATCH
bATCH
cATCH
CAATCH
CAAATCH
CAAAATCH

In fact, regular expression ranges generally follow the ASCII alphabet, (but your local character encoding may vary) so:

grep -E "[0-z]ATCH" catch.txt

Will return:

PATCH
BATCH
CATCH
pATCH
bATCH
cATCH
2ATCH
:ATCH
^ATCH
CAATCH
CAAATCH
CAAAATCH

However, it is important to also note that the ASCII alphabet has a few characters between numbers and uppercase letters such as : and >, so you would also match :ATCH and >ATCH (if it was in the file), repectively. There are also a few symbols between upper and lowercase letters such as ^ and ], which match ^ATCH and ]ATCH (if it was in the file), respectively.

Thus, if you would only want to search for numbers, uppercase letters and lowercase letters, but NOT these characters in between, you would need to modify the range:

grep -E "[0-9A-Za-z]ATCH" catch.txt

Which will return:

PATCH
BATCH
CATCH
pATCH
bATCH
cATCH
2ATCH
CAATCH
CAAATCH
CAAAATCH

You can also note that since these characters follow the ASCII character encoding order, [Z-A] will give you an error telling you that it is an invalid range because Z comes after A, thus you can’t search from Z forward to A.

# THIS WILL PRODUCE AN ERROR
grep -E "[Z-A]ATCH" catch.txt

Another trick with ranges is the use of ^ within [] functions as a “not” function. For example:

grep -E "[^C]ATCH" catch.txt

Will return:

PATCH
BATCH
pATCH
bATCH
cATCH
2ATCH
:ATCH
^ATCH
CAATCH
CAAATCH
CAAAATCH

This will match anything ending in ATCH except a string containing CATCH.

IMPORTANT NOTE: ^ has a different function when used outside of the [] that is discussed below in anchoring.

Bioinformatics Example

The FASTQ file format is the de facto file format for sequence reads generated from next-generation sequencing technologies. This file format evolved from FASTA in that it contains sequence data, but also contains quality information. Similar to FASTA, the FASTQ file begins with a header line. The difference is that the FASTQ header is denoted by a @ character. For a single record (sequence read), there are four lines, each of which are described below:

Line	Description
1	Always begins with ‘@’, followed by information about the read
2	The actual DNA sequence
3	Always begins with a ‘+’, and sometimes the same info as in line 1
4	Has a string of characters representing the quality scores; must have same number of characters as line 2

Let’s search our Mov10_oe_1.subset.fq FASTQ file for sequences that match “TGGGCTAATG”. What command would we use to do this and how many matches do we get?

Click here for the answer

grep "TGGGCTAATG" Mov10_oe_1.subset.fq
We see that we get 4 matchs.

Now let’s further refine our search to only have results that match A, T or G preceeding the “TGGGCTAATG” in Mov10_oe_1.subset.fq. How would we do this and how many matches do we get now?

Click here for the answer

grep "[ATG]TGGGCTAATG" Mov10_oe_1.subset.fq
We only get 1 match now.

Special Characters

Period (.)

The . matches any character except new line. Notably, it also does not match no character. This is similar to the behavior of the wildcard ? in bash. For example:

grep -E ".ATCH" catch.txt

Will return:

PATCH
BATCH
CATCH
pATCH
bATCH
cATCH
2ATCH
:ATCH
^ATCH
CAATCH
CAAATCH
CAAAATCH

But this result will not include ATCH.

Quantifiers

Asterisk (*)

The * matches the preceeding character any number of times including zero times. For example:

grep -E "CA*TCH" catch.txt

Will return:

CATCH
CTCH
CAATCH
CAAATCH
CAAAATCH

Question Mark (?)

The ? denotes that the previous character is optional, in the following example:

grep -E "CA?TCH" catch.txt

Will return:

CATCH
CTCH

Since the “A” is optional, it will only match CATCH or CTCH, but not anything else, including COTCH which was also in our file.

Curly Brackets ({})

The {INTEGER} matches the preceeding character the number of times equal to INTEGER. For example:

grep -E "CA{3}TCH" catch.txt

Will return only:

CAAATCH

NOTE: This is one of the cases that needs the -E option, otherwise it won’t return anything. Alternatively, you can also escape the curly brackets and then you don’t need the -E option.
grep "CA\{3\}TCH" catch.txt

Bioinformatics Example

Within the Mov10_oe_1.subset.fq FASTQ files, places in the sequencing where the sequencer could not assign a base are given the base call of N. If there are too many consectuive Ns, it can likely be an indication of a poor sequencing read. Use grep to extract the lines that have 10 or more consecutive Ns.

Click here for the answer

grep -E "N{10}" Mov10_oe_1.subset.fq

Plus (+)

The + matches one or more occurrances of the preceeding character. For example:

grep -E "CA+TCH" catch.txt

Will return:

CATCH
CAATCH
CAAATCH
CAAAATCH

Anchors

Anchors are really useful tools in regular expressions because they specify if a pattern has to be found at the beginning or end of a line.

Carrot (^)

The ^ character anchors the search criteria to the beginning of the line. For example:

grep -E "^CAT" catch.txt

Will return:

CATCH
CAT

Importantly, it won’t return BOBCAT, which is also in the file, because that line doesn’t start with CAT.

REMINDER: ^ within [] functions acts as “not”!

Dollar Sign ($)

The $ character anchors the search criteria to the end of the line. For example:

grep -E "CAT$" catch.txt

Will return:

CAT
BOBCAT

This won’t match CATCH because the line doesn’t end with CAT.

Literal matches

One problem you will likely run into with these above special characters is that you may want to match one. For example, you may want to match . or ? and this is what the escape, \, is for. For example:

grep -E "C\?TCH" catch.txt

Will return:

C?TCH

It will not return CATCH or COTCH or others like C?TCH would do.

Whitespace and new lines

You can search for a tab with \t, a space with \s and a newline with \n. For example:

grep -E "CA\tTCH" catch.txt

Will return:

CA  TCH

Examples of Combining Special Characters

Much of the power from regular expression comes from how you can combine them to match the pattern you want. Below are a few examples of such:

1) If you want to find any line that starts with uppercase letters A-G, then you could do:

grep -E "^[A-G]" catch.txt

Which will return:

BATCH
CATCH
CTCH
CAATCH
CAAATCH
CAAAATCH
ATCH
CAT
BOBCAT
C?TCH
CA	TCH
C${at}CH
COTCH

2) Perhaps you want to find all lines ending with CA followed by any character except T, then you could do:

grep -E "CA[^T]$" catch.txt

This will return:

TAXICAB
TINCAN

3) We could be interersted in finding lines that start with C and end with CH with anything, including nothing, in between.

grep -E "^C.*CH$" catch.txt

This will return:

CATCH
CTCH
CAATCH
CAAATCH
CAAAATCH
C?TCH
CA	TCH
C${at}CH
COTCH

Exercises

1) Use grep to find all matches in catch.txt that start with “B” and have a “T” anywhere in the string after the “B”.

Click here for the answer

grep -E "^B.*T" catch.txt

2) Use grep to find all matches in catch.txt that don’t start with “C” and don’t end with “H”

Click here for the answer

grep -E "^[^C].*[^H]$" catch.txt

3) Use grep to find all matches in catch.txt that have atleast two “A”s in them

Click here for the answer

grep -E "A.*A" catch.txt

Take Home Message

Regex for bioinformatic applications is generally used in combination with grep, sed or awk and even other programming languages to pull specific information out of large files. You DO NOT need to memorize all of the syntax here. Instead merely bookmark it and use it as a resource for writing commands moving forward. You can also check out our bonus lesson on string manupulation!! As you write more shell commands you will become familiar with more common regex ([], ^, *).

Additional Resources

https://hbctraining.github.io/In-depth-NGS-Data-Analysis-Course/sessionVI/lessons/extra_bash_tools.html#regular-expressions-regex-in-bash-

Next Lesson »

Back to Schedule

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Learning Objectives

grep

Getting Started

Flags for modifying the function of grep

Quotations

No quotation

Single quotations

Double Quotations

grep with Regular Expressions

Ranges

Bioinformatics Example

Special Characters

Period (.)

Quantifiers

Asterisk (*)

Question Mark (?)

Curly Brackets ({})

Bioinformatics Example

Plus (+)

Anchors

Carrot (^)

Dollar Sign ($)

Literal matches

Whitespace and new lines

Examples of Combining Special Characters

Exercises

Take Home Message

Additional Resources

Flags for modifying the function of `grep`

`grep` with Regular Expressions