Wednesday, July 10, 2013

Count number of reads in a SAM file above or below a mapping quality score with awk

Another newbie use of awk. This is a one line program for counting the number of reads in a SAM file based on a mapping quality score threshold (column 5). It can also be easily modified for counting lines on some other condition.

In this example, we're testing if the value of column 5 for the row is greater than or equal to 20 (a 99% probability the read was mapped correctly), and incrementing a counter variable. The condition could be modified for any condition that is of interest to count.
$awk 'BEGIN{i=0}$5>=20 {i=1+i} END{print i}' inputsam.sam


awk also has logical operators for "and" and "or": "&&" and "||":
$awk 'BEGIN{i=0}$5>=20 && \$5<=30 {i=1+i} END{print i}' inputsam.sam