Skip to the content.

Learning Objectives

In this lesson, we will:

What is awk?

If you have ever looked up how to do a particular string manipulation using bash in stackoverflow or biostars then you have probably seen someone give an awk command as a potential solution.

awk is an interpreted programming language designed for text processing and typically used as a data extraction and reporting tool and was especially designed to support one-liner programs. You will often see the phrase “awk one-liner”. awk was created at Bell Labs in the 1970s and awk comes from from the surnames of its authors: Alfred Aho, Peter Weinberger, and Brian Kernighan. Because the name comes from initials you will often see it written as AWK. awk shares a common history with sed and even grep dating back to ed. As a result, some of the syntax and functionality can be a bit familiar at times.

I already know grep and sed, why should I learn awk?

awk can be seen as an intermediate between grep and sed and more sophisticated approaches.

The Enlightened Ones say that...

You should never use C if you can do it with a script;
You should never use a script if you can do it with awk;
Never use awk if you can do it with sed;
Never use sed if you can do it with grep.

Text source

This is best understood if we start with grep and work our way up. We will use these tools on a complex file we have been given, animal_observations.txt.

This file came to be when a park ranger named Parker asked rangers at other parks to make monthly observations of the animals they saw that day. All of the other rangers sent Parker comma separated lists and he collated them into the following file:

Date	Yellowstone	Yosemite	Acadia	Glacier
1/15/01	bison,elk,coyote	mountainlion,coyote	seal,beaver,bobcat	couger,grizzlybear,elk
2/15/01	pronghorn	blackbear,deer	moose,hare	otter,deer,mountainlion
3/15/01	cougar,grizzlybear	fox,coyote,deer	deer,skunk	beaver,elk,lynx
4/15/01	moose,bison	bobcat,coyote	blackbear,deer	mink,wolf
5/15/01	coyote,deer	blackbear,marmot	otter,fox	deer,blackbear
6/15/01	pronghorn	coyote,deer	mink,deer	bighornsheep,deer,otter
7/15/01	cougar,grizzlybear	fox,coyote,deer	seal,porpoise,deer	beaver,otter
8/15/01	moose,bison	bobcat,coyote	hare,fox	lynx,coyote
9/15/01	blackbear,lynx,coyote	coyote,deer	seal,porpoise,deer	elk,deer
10/15/01	beaver,bison,wolf	marmot,coyote	coyote,seal,skunk	mink,wolf
11/15/01	bison,elk,coyote	marmot,fox	deer,skunk	moose,blackbear
12/15/01	crane,beaver,blackbear	mountainlion,coyote	mink,deer	bighornsheep,beaver
1/15/02	moose,bison	coyote,deer	coyote,seal,skunk	couger,grizzlybear,elk
2/15/02	cougar,grizzlybear	marmot,fox	otter,fox	mountaingoat,deer,elk
3/15/02	beaver,bison,wolf	blackbear,deer	moose,hare	mountainlion,bighornsheep
4/15/02	pronghorn	fox,coyote,deer	deer,skunk	couger,grizzlybear,elk
5/15/02	coyote,deer	blackbear,marmot	hare,fox	mink,wolf
6/15/02	crane,beaver,blackbear	bobcat,coyote	seal,porpoise,deer	elk,deer
7/15/02	bison,elk,coyote	marmot,fox	coyote,seal,skunk	couger,grizzlybear,elk
8/15/02	cougar,grizzlybear	blackbear,marmot	blackbear,deer	mountaingoat,deer,elk
9/15/02	moose,bison	coyote,deer	hare,fox	elk,deer
10/15/02	beaver,bison,wolf	mountainlion,coyote	deer,skunk	bighornsheep,beaver
11/15/02	moose,bison	blackbear,marmot	mink,deer	couger,grizzlybear,elk
12/15/02	coyote,deer	fox,coyote,deer	moose,hare	moose,blackbear

We see the date of observation and then the animals observed at each of the 5 parks. Each column is separated by a tab. You can find Parker’s file in your advanced_shell directory, it is called animal_observations.txt.

So let’s say that we want to know how many dates a cougar was observed at any of the parks. We can easily use grep for that:

grep "cougar" animal_observations.txt

When we do that 4 lines pop up, so 4 dates. We could also pipe this output to wc -l to get a count:

grep "cougar" animal_observations.txt | wc -l

There seemed to be more instances of cougar though. Four seems low compared to what we saw when glancing at the document. If we look at the document again, we can see that the park ranger from Glacier National Park cannot spell and put “couger” instead of “cougar”. Come on man!

Replacing those will be a bit hard with grep but we can use sed instead!

sed 's/couger/cougar/g'  animal_observations.txt > animal_observations_edited.txt

We are telling sed to replace all versions of “couger” with “cougar” and output the results to a new file called animal_observations_edited.txt. If we re-run our grep command:

grep "cougar" animal_observations_edited.txt

We can see that we now have 9 lines (dates) instead of 4.

So far, so good. But let’s now say that we want to know how many times a coyote was observed at Yosemite Park (ignoring all other parks) without editing our file…

While this is possible with grep it is actually easier to do with awk!

Basics of awk

Before we dive too deeply into awk we need to define two terms that awk will use a lot:

For our first awk command let’s mimic what we just did with grep. To pull all instances of coyote from animal_observations_edited.txt using awk:

awk '/coyote/' animal_observations_edited.txt

Here '/coyote/' is the pattern we want to match and since we have not told awk anything else it performs it’s default behavior, which is to print the matched lines.

But we only care about coyotes from Yosemite Park! How do we do that?

awk '$3 ~ /coyote/' animal_observations_edited.txt

Let’s break this down!

As we run this command we see that the output is super messy because Parker’s original file is a bit of a mess. This is because the default behavior of awk is to print all matching lines. It is hard to even check if the command did the right thing. However, we can ask awk to only print the Yosemite column and the date (columns 1 and 3):

awk '$3 ~ /coyote/ {print $1,$3}' animal_observations_edited.txt

This shows a great feature of awk, chaining commands. The print command within the {} will ONLY be executed when the first criteria is met.

We now know basic awk syntax:

awk ' /pattern/ {action} ' file1 file2 ... fileN

A few things to note before you try it yourself!


Exercise

Can you print all of the times a seal was observed in Acadia Park? Did you print it the messy or neat way?

Click here for the answer Messy way:
awk '$4 ~ /seal/' animal_observations_edited.txt
Neat way:
awk '$4 ~ /seal/ {print $1,$4}' animal_observations_edited.txtcode

Were seals ever observed in any of the other parks? Note that || functions as “or” in awk.

Click here for the answer Some options:
Either way, seals are only ever observed in Arcadia

Before we move on, it is sometimes helpful to know that regular text can be added to awk print commands. For example we can modify our earlier command to be:

awk '$3 ~ /coyote/ {print "On this date, ", $1", coyotes were observed in Yosemite Park"}' animal_observations_edited.txt

awk predefined variables

Before we continue our awk journey we want to introduce you to some of the awk pre-defined variables. Although there are more than just the ones we cover, these are the most helpful to start. More can be found here.

NR is particularly useful for skipping records (i.e., rows). For example, if we only care about coyotes observed in 2002 and not 2001 we can skip the records 1-13 of animal_observations_edited.txt.

awk 'NR>13 && $3 ~ /coyote/ {print $1,$3}' animal_observations_edited.txt

Because we have given two patterns to match (record greater than 13 and column 3 containing the string coyote) we need to put && in between them to note that we need both fulfilled. If we wanted either of the two patterns to match (i.e. record is greater than 13 OR the string coyote is present in field 3) we could use || to signify “or”, as we did above.

You have probably already noticed that Parker’s file contains both comma separated fields and tab separated fields. This is no problem for awk if we denote the FS variable. Let’s use both FS and NF to print the total number of kinds animals observed in all the parks. Note that we will not delete duplicates (i.e., if coyotes are observed in both Yosemite and Acadia we will consider it to be 2 instead of 1).

awk -F '[[:blank:],]' '{print NF}' animal_observations_edited.txt

This is more complex than anything else we have done so let’s break it down:

awk -F '[[:blank:],]' '{print NF-1}' animal_observations_edited.txt

Exercise

The second issue is that we don’t want to include the first record (row) as this is our header and not representative of any animals. How would you modify the command to skip the first record?

Click here for the answer awk -F '[[:blank:],]' 'NR>1 {print NF-1}' animal_observations_edited.txt

Piping different separators

We can do more advanced commands with our separators by piping awk commands. For example, we can pull lines where coyote is the SECOND animal listed for Yosemite park.

Before we do that let’s take a step back. You may be wondering why on earth we need this kind of command. While something like this may not be particularly useful for Parker’s data, this kind of command is key for looking at some complex NGS files!

For example take a look at this GFF3 file

chr3	ENSEMBL	five_prime_UTR	50252100	50252137	.	+	.	ID=UTR5:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=2;exon_id=ENSE00003567505.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3	ENSEMBL	three_prime_UTR	50257691	50257714	.	+	.	ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=8;exon_id=ENSE00003524043.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3	ENSEMBL	three_prime_UTR	50258368	50259339	.	+	.	ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=9;exon_id=ENSE00001349779.3;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3	ENSEMBL	gene	50227436	50227490	.	+	.	ID=ENSG00000275334.1;gene_id=ENSG00000275334.1;gene_type=miRNA;gene_name=MIR5787;level=3;hgnc_id=HGNC:49930
chr3	ENSEMBL	gene	52560570	52560707	.	+	.	ID=ENSG00000221518.1;gene_id=ENSG00000221518.1;gene_type=snRNA;gene_name=RNU6ATAC16P;level=3;hgnc_id=HGNC:46915
chr3	ENSEMBL	transcript	52560570	52560707	.	+	.	ID=ENST00000408591.1;Parent=ENSG00000221518.1;gene_id=ENSG00000221518.1;transcript_id=ENST00000408591.1;gene_type=snRNA;gene_name=RNU6ATAC16P;transcript_type=snRNA;transcript_name=RNU6ATAC16P-201;level=3;transcript_support_level=NA;hgnc_id=HGNC:46915;tag=basic,Ensembl_canonical

We can see that all colums are tab-delimited but column 9 has a bunch of ; separated items. This type of command would be useful for something like pulling out all lines where gene_type is snRNA. In fact, all of the commands we are teaching today are useful on one or another NGS-related document (VCF, GFF3, GTF, BED, etc). We are using Parker’s data instead because we can use ALL of these types of commands on his dataset.

Returning to our original task, pulling lines where coyote is the SECOND animal listed for Yosemite park. We can do it like this:

awk '{ print $3 }' animal_observations_edited.txt | awk -F "," '$2 ~ "coyote"' 

Let’s break this command up:

You might have noticed that here we used "coyote" instead of /coyote/ This is because we want the entire field to be solely coyote ("coyote") rather than containing the string coyote (/coyote/).


Exercise

What command would you give to print all of the observation dates that took place in May?

Click here for the answer awk '{ print $1 }' animal_observations_edited.txt | awk -F "/" '$1 ~ "5"'

Counting

One of the best features of awk is that it can count up how many times a string occurs in a column. Let’s use this to see how many times each set of animal observations occurs in Yellowstone park.

awk ' { counter[$2] += 1 } END { for (animalgroup in counter){ print animalgroup, counter[animalgroup] } }' animal_observations_edited.txt

This command is complex and contains new syntax so lets go through it bit by bit:

Now that we understand our command, let’s run it!

It works! We can see that “moose,bison” is the most commonly observed group of animals at Yellowstone! How Thrilling!

Exercise

  1. What was the most commonly observed group of animals at Glacier National Park?
Click here for the answer
awk ' { counter[$5] += 1 } END { for (animalgroup in counter){ print animalgroup, counter[animalgroup] } }' animal_observations_edited.txt

cougar,grizzlybear,elk is the most commonly observed group!
  1. Our code also counts the number of times our header text (Yosemite or Glacier) is repeated. How can you modify the code so that this is ignored?
Click here for the answer For Yosemite:

Bioinformatic Application

Counting can be a great way to summarize different annotation files (GFF3, GTF, etc). This is especially true when working with new files that have been generated by other people. Here is the GFF3 file we showed above but slightly edited.

chr3	entrez	five_prime_UTR	50252100	50252137	.	+	.	ID=UTR5:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=2;exon_id=ENSE00003567505.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3	ENSEMBL	three_prime_UTR	50257691	50257714	.	+	.	ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=8;exon_id=ENSE00003524043.1;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3	entrez	three_prime_UTR	50258368	50259339	.	+	.	ID=UTR3:ENST00000266027.9;Parent=ENST00000266027.9;gene_id=ENSG00000114353.17;transcript_id=ENST00000266027.9;gene_type=protein_coding;gene_name=GNAI2;transcript_type=protein_coding;transcript_name=GNAI2-201;exon_number=9;exon_id=ENSE00001349779.3;level=3;protein_id=ENSP00000266027.6;transcript_support_level=2;hgnc_id=HGNC:4385;tag=basic,CCDS;ccdsid=CCDS63644.1;havana_gene=OTTHUMG00000156940.2
chr3	ENSEMBL	gene	50227436	50227490	.	+	.	ID=ENSG00000275334.1;gene_id=ENSG00000275334.1;gene_type=miRNA;gene_name=MIR5787;level=3;hgnc_id=HGNC:49930
chr3	entrez	gene	52560570	52560707	.	+	.	ID=ENSG00000221518.1;gene_id=ENSG00000221518.1;gene_type=snRNA;gene_name=RNU6ATAC16P;level=3;hgnc_id=HGNC:46915
chr3	ENSEMBL	transcript	52560570	52560707	.	+	.	ID=ENST00000408591.1;Parent=ENSG00000221518.1;gene_id=ENSG00000221518.1;transcript_id=ENST00000408591.1;gene_type=snRNA;gene_name=RNU6ATAC16P;transcript_type=snRNA;transcript_name=RNU6ATAC16P-201;level=3;transcript_support_level=NA;hgnc_id=HGNC:46915;tag=basic,Ensembl_canonical

The second column tells us where the annotation comes from and the third column tells us what kind of feature it is. Both of these columns can be useful to summarize when you are starting to work with a new GFF3 file.

# DO NOT RUN THIS CODE
awk ' { counter[$2] += 1 } END { for (source in counter){ print source, counter[source] } }' my_gtf.gtf
# DO NOT RUN THIS CODE
awk ' { counter[$3] += 1 } END { for (feature in counter){ print feature, counter[feature] } }' my_gtf.gtf

Exercise

How might you edit the above commands to count the number of each gene_type? Hint: We already know you can pipe multiple awk commands in shell to get to what you want (see above). Reminder that when you pipe, the file name needs to go with the first part of the pipe!

You can test your code out with the file hg38_subset.gff in the advanced_shell folder.

Click here for the answer
awk '{print $9}' hg38_subset.gff | awk -F ";" '{print $5}' | awk -F "=" ' { counter[$2] += 1 } END { for (type in counter){ print type, counter[type] } }'

Parsing awk code written by other people

We have gone through some simple examples here, but there will likely come a time where you end up searching the web for a more complex application of awk. Let’s take a look at some code and see if we can tell what it does.

### DO NOT RUN ###
awk 'NR>=20&&NR<=80' input.txt
### DO NOT RUN ###
awk 'NR > 1 && NF == 4' data.txt

Take a look at test.vcf to see if you can understand this one!

### DO NOT RUN ###
awk '$1 == "chr5" && $7 == "PASS" { print }' data.vcf

A super useful awk one liner you have seen before!

If you came to the Accelerate with Automation module you have already seen this code! This is an incredibly useful awk command to keep in your back pocket.

### DO NOT RUN ###

for ((i=1; i<=10; i+=1))
    do
sam=$(awk -v awkvar="${i}" 'NR==awkvar' samples.txt)
samtools view -S -b ${sam}.sam > ${sam}.bam
done

This actually combines a number of basic and intermediate shell topics such as variables, for loops, and awk!

With our new awk expertise let’s take a look at that awk command alone!

### DO NOT RUN ###
awk -v awkvar="${i}" 'NR==awkvar' samples.txt

We have not encountered -v yet. The correct syntax is -v var=val which assigns the value val to the variable var, before execution of the program begins. So what we are doing is creating our own variable within our awk program, calling it awkvar and assigning it the value of ${i} which will be a number between 1 and 10 (see for loop above). ${i} and thus awkvar will be different for each loop.

Then we are simply saying that the predetermined variable NR (The number of records, i.e. line number), will be equal to awkvar which will be equal to ${i}.

Here is what samples.txt looks like

DMSO_control_day1_rep1
DMSO_control_day1_rep2
DMSO_control_day2_rep1
DMSO_control_day2_rep2
DMSO_KO_day1_rep1
DMSO_KO_day1_rep2
.......
Drug_KO_day2_rep1
Drug_KO_day2_rep2

When ${i} is equal to 3 what will our awk command spit out? Why?

With our new expertise, we can not only write our own awk commands but we can understand commands that others have written. Go forth and awk!

Additional cool awk commands

For these commands we will return to ecosystems.txt

BEGIN

The BEGIN command will execute an awk expression once at the beginning of a command. This can be particularly useful it you want to give an output a header that doesn’t previously have one.

awk 'BEGIN {print "new_header"} NR>1 {print $1}' ecosystems.txt

In this case we have told awk that we want to have new_header printed before anything, then NR>1 is telling awk to skip the old header and finally we are printing the first column of ecosystems.txt with {print $1}.

END

We already had some experience with END above. Related to the BEGIN command, the END command that tells awk to do a command once at the end of the file. We will first demonstrate how it works by adding a new record:

awk '{print $1} END {print "new_record"}' ecosystems.txt

As you can see, this has simply added a new record to the end of a file. Furthermore, you can chain multiple END commands together to continously add to columns if you wished like:

awk '{print $1} END {print "new_record"} END {print "newer_record"}' ecosystems.txt

This is equivalent to separating your print commands with a ;:

awk '{print $1} END {print "new_record"; print "newer_record"}' ecosystems.txt

if statements

Since awk is it’s own fully-fledged programming language, it also has conditional statements. A common time you might want to use an if statement in awk is when you have a file with tens or even hundreds of fields and you want to figure out which field has the column header of interest or a case where you are trying to write a script for broad use when the order of the input columns may not always be the same, but you want to figure out which column has a certain column header. To do that:

awk 'NR=1 {for (i=1; i<=NF; i=i+1) {if ($i == "height(cm)")  print i}}' ecosystems.txt

We can break this code down a bit:

Additional Resources


This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.