Wednesday, August 21, 2013

awk print every other line (or every Nth line) in fasta file

This specific line of awk doesn't have much general utility, but it was intended to pull out every other sequence record in a .fasta file. It can be applied to every Nth record in the fasta file as well by changing the modulo operator statement.  It only applies to .fasta files in which the sequence string isn't wrapped into multiple lines.

Here it is in its one-liner form:
awk 'BEGIN{i=0} (substr($0,1,1) == ">") { if (i%2 == 0) {print $0; getline; print $0} i++}' test.fa
And it makes a bit more sense when formatted:
awk 'BEGIN{i=0} (substr($0,1,1) == ">") {
 if (i%2 == 0) {
  print $0
  getline
  print $0
 }
 i++
}' test.fa
This assumes the .fasta file is of the format:
>SequenceID1
ATGACTA
>SequenceID2
AGGCATG
and the sequence string is contained entirely on one line.

No comments:

Post a Comment