Monday, October 7, 2013

Using awk: bitwise operations and string manipulation

I learned a few new things trying to reformat a sam file with awk. I also learned that what I was attempting to do wouldn't ultimately solve my problem, but I wanted to keep what I learned about bitwise operations and other string manipulations in awk here.

To test for the presence of a certain bit, I had to use GNU awk (gawk) since the awk on my system didn't have these functions, and I used the and() function (see here for bitwise operation documentation; see here and here for additional sam bit flag information).

I used substring() for string manipulation and length() to access elements from the end of the string (credit goes to here and here).

I used printf when I wanted to avoid generating a newline character (credit goes to here).

The original goal was to trim restriction sites from RAD reads and quality scores, but this doesn't account for the CIGAR string or the alignment coordinate so it ultimately doesn't work. Here is the gawk script in its entirety (with some comments):

#!/bin/bash
#Skip the header lines
gawk '(substr($0, 1, 1) != "@") {
        #Test if the bit flag has the 4th bit set (to indicate a read on the reverse strand using the and() function.  
        if (and($2,0x10)) {
                #Don't adjust the first 9 columns.
                for (i=1; i<=9; i++) {printf "%s\t" , $i}
                #Trim 5 bases from the end of the string and the quality score
                printf "%s\t", substr($10, 1, length($10)-5)
                printf "%s\t", substr($11, 1, length($11)-5)
                #Print any remaining columns; NF is a special variable containing the number of fields. 
                for (i=12; i<NF; i++) {printf "%s\t", $i}
                printf "%s\n", $NF
        }
        #Else it is not a reverse strand read; trim 5 bases from the start.
        else {
                for (i=1; i<=9; i++) {printf "%s\t" , $i}
                printf "%s\t", substr($10, 6)
                printf "%s\t", substr($11, 6) 
                for (i=12; i<NF; i++) {printf "%s\t", $i}
                printf "%s\n", $NF
        }
}
#Include the header lines.
(substr($0, 1, 1) == "@") {print $0}
' input.sam

No comments:

Post a Comment