Wednesday, October 9, 2013

Using awk: removing restriction site sequence from reads in fastq format

I had a fastq file in which the sequence reads contained the restriction enzyme site from RAD-seq reads. In order to use them with reads that had the site trimmed off already, I needed to get rid of the RE site.

The fastq file was of the format:
@1_1101_11092_1965_1
CCGGCACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT
+
HIHHIIHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC
@1_1101_11396_1963_1
CCGGCGACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA
+
HJJIJJJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD

I used the substring() function of awk to get rid of the first 5 characters (corresponding to the RE site) in each sequence or quality line as so:
#!/bin/bash
awk '(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)}
     (substr($0, 1, 1) == "@") || ($0 == "+") {print $0}' input.fastq > output.fastq

Output sans RE site:
@1_1101_11092_1965_1
ACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT
+
IHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC
@1_1101_11396_1963_1
GACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA
+
JJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD

No comments:

Post a Comment