Querying NCBI dbSNP for rsID mergers with Python (a.k.a. where is RsMergeArch.bcp.gz?)

One of our current projects is a systems genetics project that involves the interrelation of multiple genetic datasets containing human genetic variants. The different experiments were performed and analyzed relative to different reference genomes and dbSNP builds, so one of the first steps is making sure which variants correspond in each dataset. Since different builds of dbSNP were used, the rsIDs don't always have a one to one correspondence; some rsIDs are merged and others lost. So, the goal here was to update the rsIDs from the older build to the rsIDs used in the newer build.

It looks like, at one point in the history of dbSNP, flat files of rsID mergers were hosted via FTP in a file called RsMergeArch.bcp.gz (e.g., https://www.biostars.org/p/128601/ and https://www.ncbi.nlm.nih.gov/books/NBK44496/#Schema.where_do_i_find_a_list_of_merged ). Unfortunately, I was unable to find such a file for dbSNP 150. However, through the web interface, you can indeed search for "mergedrs" at Entrez SNP (https://www.ncbi.nlm.nih.gov/snp) and get the merger information. Moreover, the "Chromosome Report" output can be downloaded as a flatfile that contains most of the information necessary for the mapping of rsIDs. Given all of that, it seemed the best way to get and process the updates would be by programmatically interacting with the database.

Fortunately, Biopython.Entrez enables exactly this (http://biopython.org/DIST/docs/api/Bio.Entrez-module.html). If you're using anaconda, something like the following should get you biopython:
conda install biopython
Then you can use something like the following to fetch those records (a bit more documentation regarding efetch can be found at https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch ):
from Bio import Entrez
import sys

Entrez.email = 'yourEmail@institution.com'

handle = Entrez.efetch(db="snp", id="rs140337953, rs116440577, rs150021059", rettype = 'chr', retmode = 'text')
sys.stdout.write(handle.read()) handle.close()
The output printed to stdout here is the same content that you'd get if you ran the search with the SNPs, e.g. https://www.ncbi.nlm.nih.gov/snp/?term=rs140337953+or+rs116440577+or+rs150021059 , and then sent it to file with chromosome report format.

You could write the records to a file, or parse the record string, and go on your merry way. Of note, there are a number of reports that the XML parsers within Bio.Entrez are unable to parse the dbSNP XML (and they also didn't work for me), so parsing the XML output of the fetch may not work; I ended up having to parse the text output. This blog here posts a more sophisticated example of querying and parsing dbSNP: http://www.danielecook.com/getting_snp_dat/ .

On an unrelated note, one of our colleagues keeps a bird feeder, and the messy birds planted a nice corn plant to make us feel a little closer to the field. Thanks, birds.


Popular Posts