Extracting sequence information from Stacks samples
TL;DR: On the Stacks mailing list, Ryan Waples posted what looks to be a useful bit of Python code for extracting sequence information from Stacks output. With his permission, I've reposted some of the dialogue and Waples' code here.
And for the full post:
An email passed through on the Stacks listserve from user named Scott Hotaling who had requested the following:
"What I'm after is essentially the X.haplotypes.tsv file that populations outputs with the full sequence context instead of only the variable position(s). I understand this information could be determined manually via catalog.snps.tsv file. But, it would be great to have the full sequence information as an output option of populations."
In response, Ryan Waples replied with the following:
Scott,
You can try to use the attached python script. It is not a polished release but I think it should be able to get something close to what you want.
It looks in the catalog and individual files and constructs full fasta seqeunces from the individual haplotypes and the catalog's consensus. I haven't used the data from this myself, so please double check this is doing what you expect.
You can use it with a command line as follows:
python stacks_to_fasta_matches.py stacks_path catalog_name individual_name subset_file FASTA_file
stacks_path = path to the stacks data you want to use, must contain the catalog and individual files [/stacks/data/]
catalog name = name of the catalog [batch_3.catalog]
individual name = name of ind [PHOOD_2131]
subset_file = A list of catalog IDs, one per line, A fasta entry will be generated for each haplotype of the individual at these loci [./subset.txt]
FASTA_file = the output. [./output.fa]
For 'consensus' loci, just a single sequence will be printed, not two copies. This script doesn't check the 'model' calls in the tags files of the individuals, and I think it would spit out more than two seqeunces if more than two haplotypes were seen in an individual.
Good luck,
-Ryan"
It may be of interest to do something similar with our data in the future, so I wanted to keep the code available. It also employs a few practices that I'd like to learn from. Thanks again to Ryan Waples for allowing it to be reposted here. Here is his code:
#!/usr/bin/env python # Python 2.7 # stacks_to_fasta_matches.py v1.1 # 5/21/13 # Ryan Waples # Command line outline # > python stacks_to_fasta_matches.py stacks_path catalog_name individual_name subset_file FASTA_file # Usage example # > python ./stacks_to_fasta_matches.py /stacks/data/ batch_3.catalog PHOOD_2131 ./subset.txt ./output.fa # Inputs # A set of stacks catalog and individual files (*.tags.tsv, *.snp.tsv, *.alleles.tsv) # specified by a [path] + [catalog_name] + [individual_name] # A 'whitelist' of tags. # specefied by a [filename] # tags_subset_file format - one tag_id per line. # 1 # 43 # 12395 # 6549 # etc.. # Outputs # An FASTA file # For each white listed tag: # Each haplotype present in the catalog will generate one FASTA sequence entry. # FASTA header: # >[TAG#]_[ALLELE]_[SNP1_pos]-[SNP2_pos]-[SNP3_pos]-[...] import sys def stacks_to_fasta_matches(stacks_path, catalog_name, individual_name, subset_file, out_file): """ Takes a whitelist of catalog tags and prints each observed haplotype in the supplied indiviudal into a sequence to a FASTA file. Function also returns a dictionary of FASTA entries. """ print "\n" if not stacks_path.endswith('/'): stacks_path = stacks_path + "/" print ("\n".join(['stacks_path ->\t' + stacks_path, 'catalog_name ->\t' + catalog_name, "individual_name ->\t" + individual_name, 'subset_file ->\t' + subset_file, 'out_file ->\t'+ out_file])) print "\n" #used variables tags_to_keep = set() # strings seq_of_tag = dict() # strings snp_pos_of_tag = dict() # lists of ints alleles_of_tag = dict() # lists of strings ind_ids_of_cat_id = dict() # set of strings FASTA_dict = dict() # strings with open(subset_file, "r") as SUBSET: for line in SUBSET: tag = line.strip() tags_to_keep.add(tag) with open("{0}{1}{2}".format(stacks_path, catalog_name, ".tags.tsv" ), "r") as TAGS: for line in TAGS: columns = line.strip().split("\t") id = columns[2] seq = columns[9] seq_type = columns[6] if (id in tags_to_keep and seq_type == 'consensus'): seq_of_tag[id] = seq with open("{0}{1}{2}".format(stacks_path, catalog_name, ".snps.tsv" ), "r") as SNPS: for line in SNPS: columns = line.strip().split("\t") id = columns[2] pos = columns[3] if id in tags_to_keep and id in seq_of_tag: snp_pos_of_tag.setdefault(id, []) snp_pos_of_tag[id].append(int(pos)) with open("{0}{1}{2}".format(stacks_path, individual_name, ".matches.tsv" ), "r") as MATCHES: for line in MATCHES: columns = line.strip().split("\t") cat_id = columns[2] ind_id = columns[4] allele = columns[5] if (cat_id in tags_to_keep and cat_id in seq_of_tag): alleles_of_tag.setdefault(cat_id, []) alleles_of_tag[cat_id].append(allele) ind_ids_of_cat_id.setdefault(cat_id, set()) ind_ids_of_cat_id[cat_id].add(ind_id) with open(out_file, "w") as FASTA: for cat_id in sorted(alleles_of_tag.keys(), key=int): base_seq = seq_of_tag[cat_id] haplotype = base_seq[:] for allele in alleles_of_tag[cat_id]: snp_index = 0 try: for snp in sorted(snp_pos_of_tag[cat_id]): haplotype = haplotype[:snp] + allele[snp_index] + haplotype[snp+1:] snp_index += 1 except KeyError: # happens on 'consensus' genotypes haplotype = haplotype try: snp_positions = "_".join([str(x) for x in snp_pos_of_tag[cat_id]] ) except KeyError: snp_positions = "none" ind_id = "_".join(ind_ids_of_cat_id[cat_id]) header = ">cat:{}|ind:{}|allele:{}|pos:{}".format(cat_id, ind_id, allele, snp_positions) FASTA_dict[header] = haplotype header_line = header + "\n" seq_line = haplotype + "\n" FASTA.write(header_line) FASTA.write(seq_line) return FASTA_dict def main(argv=None): if argv is None: argv = sys.argv[1:] stacks_to_fasta_matches(*argv) if __name__ == "__main__": main(sys.argv[1:])
Comments
Post a Comment