Sunday, July 21, 2013

Extracting sequence information from Stacks samples

TL;DR: On the Stacks mailing list, Ryan Waples posted what looks to be a useful bit of Python code for extracting sequence information from Stacks output. With his permission, I've reposted some of the dialogue and Waples' code here.

And for the full post:

An email passed through on the Stacks listserve from user named Scott Hotaling who had requested the following:

"What I'm after is essentially the X.haplotypes.tsv file that populations outputs with the full sequence context instead of only the variable position(s). I understand this information could be determined manually via catalog.snps.tsv file. But, it would be great to have the full sequence information as an output option of populations."

In response, Ryan Waples replied with the following:


You can try to use the attached python script.   It is not a polished release but I think it should be able to get something close to what you want.

It looks in the catalog and individual files and constructs full fasta seqeunces from the individual haplotypes and the catalog's consensus.  I haven't used the data from this myself, so please double check this is doing what you expect.

You can use it with a command line as follows:
python   stacks_path   catalog_name   individual_name  subset_file   FASTA_file

stacks_path  =  path to the stacks data you want to use, must contain the catalog and individual files [/stacks/data/]
catalog name = name of the catalog [batch_3.catalog]
individual name = name of ind [PHOOD_2131]
subset_file  = A list of catalog IDs, one per line, A fasta entry will be generated for each haplotype of the individual at these loci [./subset.txt]
FASTA_file =  the output. [./output.fa]

For 'consensus' loci, just a single sequence will be printed, not two copies.  This script doesn't check the 'model' calls in the tags files of the individuals, and I think it would spit out more than two seqeunces if more than two haplotypes were seen in an individual.

Good luck,


It may be of interest to do something similar with our data in the future, so I wanted to keep the code available. It also employs a few practices that I'd like to learn from. Thanks again to Ryan Waples for allowing it to be reposted here. Here is his code:

#!/usr/bin/env python
# Python 2.7
# v1.1
# 5/21/13
# Ryan Waples

# Command line outline
# > python   stacks_path   catalog_name individual_name  subset_file   FASTA_file

# Usage example
# > python ./ /stacks/data/ batch_3.catalog   PHOOD_2131  ./subset.txt  ./output.fa

# Inputs
 # A set of stacks catalog and individual files (*.tags.tsv, *.snp.tsv, *.alleles.tsv)
  # specified by a [path] + [catalog_name] + [individual_name]
 # A 'whitelist' of tags.
  # specefied by a [filename]
  # tags_subset_file  format - one tag_id per line.
   # 1
   # 43
   # 12395
   # 6549
   # etc..

# Outputs 
 # An FASTA file 
  # For each white listed tag:
   # Each haplotype present in the catalog will generate one FASTA sequence entry.
  # FASTA header: 
   # >[TAG#]_[ALLELE]_[SNP1_pos]-[SNP2_pos]-[SNP3_pos]-[...]

import sys

def stacks_to_fasta_matches(stacks_path, catalog_name, individual_name, subset_file, out_file):
 Takes a whitelist of catalog tags and prints each observed haplotype in the supplied indiviudal
 into a sequence to a FASTA file.
 Function also returns a dictionary of FASTA entries.
 print "\n"
 if not stacks_path.endswith('/'):
  stacks_path = stacks_path + "/"
 print ("\n".join(['stacks_path ->\t' + stacks_path, 
      'catalog_name ->\t' + catalog_name, 
      "individual_name ->\t" + individual_name,
      'subset_file ->\t' + subset_file, 
      'out_file ->\t'+ out_file]))
 print "\n" 
 #used variables
 tags_to_keep  = set()  # strings
 seq_of_tag  = dict() # strings
 snp_pos_of_tag = dict() # lists of ints
 alleles_of_tag = dict() # lists of strings
 ind_ids_of_cat_id = dict() # set of strings
 FASTA_dict  = dict() # strings

 with open(subset_file, "r") as SUBSET:
  for line in SUBSET:
   tag = line.strip()

 with open("{0}{1}{2}".format(stacks_path, catalog_name, ".tags.tsv" ), "r") as TAGS:
  for line in TAGS:
   columns = line.strip().split("\t")
   id = columns[2] 
   seq = columns[9]
   seq_type = columns[6]
   if (id in tags_to_keep and seq_type == 'consensus'):
    seq_of_tag[id] = seq

 with open("{0}{1}{2}".format(stacks_path, catalog_name, ".snps.tsv" ), "r") as SNPS: 
  for line in SNPS:
   columns = line.strip().split("\t")
   id = columns[2]
   pos = columns[3]
   if id in tags_to_keep and id in seq_of_tag:
    snp_pos_of_tag.setdefault(id, [])

 with open("{0}{1}{2}".format(stacks_path, individual_name, ".matches.tsv" ), "r") as MATCHES:
  for line in MATCHES:
   columns = line.strip().split("\t")
   cat_id = columns[2]
   ind_id = columns[4]
   allele = columns[5]
   if (cat_id in tags_to_keep and cat_id in seq_of_tag):
    alleles_of_tag.setdefault(cat_id, [])
    ind_ids_of_cat_id.setdefault(cat_id, set())
 with open(out_file, "w") as FASTA:
  for cat_id in sorted(alleles_of_tag.keys(), key=int):
   base_seq = seq_of_tag[cat_id]
   haplotype = base_seq[:]
   for allele in alleles_of_tag[cat_id]:
    snp_index = 0
     for snp in sorted(snp_pos_of_tag[cat_id]):
      haplotype = haplotype[:snp] + allele[snp_index] + haplotype[snp+1:]
      snp_index += 1
    except KeyError: # happens on 'consensus' genotypes
     haplotype = haplotype
     snp_positions = "_".join([str(x) for x in snp_pos_of_tag[cat_id]] )
    except KeyError:
     snp_positions = "none"
    ind_id = "_".join(ind_ids_of_cat_id[cat_id])
    header = ">cat:{}|ind:{}|allele:{}|pos:{}".format(cat_id, ind_id, allele, snp_positions)
    FASTA_dict[header] = haplotype
    header_line = header + "\n"
    seq_line = haplotype + "\n"

 return FASTA_dict 

def main(argv=None):
 if argv is None:
  argv = sys.argv[1:]

if __name__ == "__main__":

No comments:

Post a Comment