Using awk: removing restriction site sequence from reads in fastq format
I had a fastq file in which the sequence reads contained the restriction enzyme site from RAD-seq reads. In order to use them with reads that had the site trimmed off already, I needed to get rid of the RE site.
The fastq file was of the format:
I used the substring() function of awk to get rid of the first 5 characters (corresponding to the RE site) in each sequence or quality line as so:
Output sans RE site:
The fastq file was of the format:
@1_1101_11092_1965_1 CCGGCACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT + HIHHIIHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC @1_1101_11396_1963_1 CCGGCGACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA + HJJIJJJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD
I used the substring() function of awk to get rid of the first 5 characters (corresponding to the RE site) in each sequence or quality line as so:
#!/bin/bash awk '(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)} (substr($0, 1, 1) == "@") || ($0 == "+") {print $0}' input.fastq > output.fastq
Output sans RE site:
@1_1101_11092_1965_1 ACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT + IHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC @1_1101_11396_1963_1 GACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA + JJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD
Comments
Post a Comment