Skip to the content.

Overview


Setting up


# Change directories
$ cd ~/unix_lesson

# Copy over a file
$ cp /n/groups/hbctraining/ngs-data-analysis-longcourse/unix_lesson/bicycle.txt .

Regular expressions (regex) in bash

“A regular expression, regex or regexp (sometimes called a rational expression) is a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings.” -Wikipedia

“The specific syntax rules vary depending on the specific implementation, programming language, or library in use. Additionally, the functionality of regex implementations can vary between versions of languages.” -Wikipedia

Below is a small subset of characters that can be used for pattern generation in bash.

Special Characters:

Examples:

above examples excerpted from Wikipedia

Non printable characters:


Reintroducing grep (GNU regex parser)

As we have seen in session I, grep is a line by line parser by default displays matching lines to the pattern of interest that allows the use of regular expressions (regex) in the specified pattern.

grep usage:

cat file | grep pattern

OR

grep pattern file

grep common options:


Examples grep usage

$ grep -c bicycle bicycle.txt

$ grep "bicycle bicycle" bicycle.txt 

$ grep ^bicycle bicycle.txt
$ grep ^Bicycle bicycle.txt 

$ grep yeah$ bicycle.txt

$ grep [SJ] bicycle.txt

$ grep ^[SJ] bicycle.txt 

Introducing sed

sed takes a stream of stdin and pattern matches and returns the replaced text to stdout (“Think amped-up Windows Find & Replace”).

sed usage:

cat file | sed ‘command’

OR

sed ‘command’ file

sed common options:

Examples sed usage

$ sed '1,2d' bicycle.txt

$ sed 's/Superman/Batman/' bicycle.txt 

$ sed 's/bicycle/car/' bicycle.txt 
$ sed 's/bicycle/car/g' bicycle.txt 

$ sed 's/.icycle/car/g' bicycle.txt

$ sed 's/bi*/car/g' bicycle.txt

$ sed 's/bicycle/tri*cycle/g' bicycle.txt | sed 's/tri*cycle/tricycle/g'   ## does this work?
$ sed 's/bicycle/tri*cycle/g' bicycle.txt | sed 's/tri\*cycle/tricycle/g'

$ sed 's/\s/\t/g' bicycle.txt
$ sed 's/\s/\\t/g' bicycle.txt

$ sed 's/\s//g' bicycle.txt

Reintroducing awk

`awk is command/script language that turns text into records and fields which can be selected to display as kind of an ad hoc database. With awk you can perform many manipulations to these fields or records before they are displayed.

awk usage:

cat file | awk ‘command’

OR

awk ‘command’ file

awk concepts:

Fields:

Fields are separated by white space, or you can specifying a field separator (FS). The fields are denoted $1, $2, …, while $0 refers to the entire line. If there is no FS, the input line is split into one field per character.

The awk program has some internal environment variables that are useful (more exist and change upon platform)

awk also supports more complex statements, some examples are below:

Please note that awk is a language on it’s own, and we will only be looking at some examples os its usage.

Examples awk usage

$ awk '{print $3}' reference_data/chr1-hg19_genes.gtf | head

$ awk '{print $3 | "sort -u"}' reference_data/chr1-hg19_genes.gtf 

$ awk '{OFS = "\t" ; if ($3 == "stop_codon") print $1,$4,$5,$3,$10}' reference_data/chr1-hg19_genes.gtf | head
$ awk '{OFS = "\t" ; if ($3 == "stop_codon") print $1,$4,$5,$3,$10}' reference_data/chr1-hg19_genes.gtf | sed 's/"//g' | sed 's/;//g' | head

$ awk -F "\t" '{print $10}' reference_data/chr1-hg19_genes.gtf | head
$ awk -F "\t" '{print $9}' reference_data/chr1-hg19_genes.gtf | head

# head other/bad-reads.count.summary
$ awk -F ":" 'NR > 1 {sum += $2} END {print sum}' other/bad-reads.count.summary

# head ../rnaseq/results/counts/Mov10_featurecounts.Rmatrix.txt
$ awk 'NR > 1 {sum += $2} END {print sum}' ../rnaseq/results/counts/Mov10_featurecounts.Rmatrix.txt

These materials are adapted from training materials generated by FAS Reseach Computing at Harvard University.