In silico restriction digest and retrieval of restriction site associated DNA

My initial goal was to be able to use Biopieces to perform an in silico restriction digest of an entire genome and retrieve the DNA sequence flanking restriction sites. Biopieces performed well for scanning the genome and identifying locations with rescan_seq, but digest_seq did not scale well to the whole chromosome; digest_seq consumed over 16GB of RAM on a chromosome of ~70Mbp.

Just counting the number of restriction sites was fairly rapid (requires Biopieces, a fasta file with the reference genome, and a restriction enzyme from the rescan_seq list):
#transliterate_seq replaces N's with A's so that the sequence can be properly scanned for restriction sites.
#Without converting, rescan_seq is a bit greedy and will consider a string of N's a possible restriction site.
#rescan_seq in this example scans for the FseI sites.
read_fasta -i sbi1.fasta | transliterate_seq -s 'Nn' -r 'aa' | rescan_seq -r FseI -o | awk ' BEGIN{i=0} $1=="RE_COUNT:" {i=i+$2} END{print i} '
But getting the sequence was a little more challenging due to the memory issues of digest_seq. Ultimately I orked together a pipeline to extract the sequence, albeit very slowly at this point (~1 hour to get the sequence for 40k+ RAD tags from 20k+ RE sites).
The pipeline requires Biopieces, a fasta file containing the reference genome, (in my case the Sorghum bicolor reference genome build Sbi1), and a restriction enzyme from the rescan_seq list.
#transliterate_seq replaces N's with A's so that the sequence can be properly scanned for restriction sites.
#Without converting, rescan_seq is a bit greedy and will consider a string of N's a possible restriction site.
#rescan_seq in this example scans for the FseI sites.
#We use awk to avoid capturing the very long "SEQ:" field that is output by rescan_seq since we don't need it.
read_fasta -i sbi1.fasta | transliterate_seq -s 'Nn' -r 'aa' | rescan_seq -r FseI | awk ' $1!="SEQ:" {print;} ' > FseIout.txt
At this point, FseIout.txt contains information on the location of matches for the restriction sites in the genome. I wrote a script in Python to take this output and retrieve the restriction site associated DNA. It's quite slow, and I think the bottleneck is the system calls to get_genome_seq for each match:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#Ryan McCormick 07/12/13
#Texas A&M University
#For use in conjunction with a Biopieces pipeline
#Version 0.1

import sys
from subprocess import check_output

#Modify these variables based on genome name, desired length of RAD tag, and RE site.
str_genome = "Sbi1"
int_len = 10
int_lenREsite = 8
###

try:
    li_tsv = [line.strip() for line in open(sys.argv[1])]  # Open target file, strip newline characters from lines, define as list.
except IndexError:  # Check if arguments were given
    sys.stdout.write("No arguments received. Please check your input:\n")
    sys.stdout.write("\t$ python.py input.tsv\n")
    sys.exit()
except IOError:  # Check if file is unabled to be opened.
    sys.stdout.write("Cannot open target file. Please check your input:\n")
    sys.stdout.write("\t$ python.py input.tsv\n")
    sys.exit()

lili_allContigs = []
li_contig = []

for i in range(len(li_tsv)):
    if li_tsv[i] == "---":
        lili_allContigs.append(li_contig)
        li_contig = []
    else:
        li_contig.append(li_tsv[i])

int_numREsites = 0
for i in range(len(lili_allContigs)):
    str_chr = (lili_allContigs[i][4].split(" "))[1]
    str_RE = (lili_allContigs[i][0].split(" "))[1]
    try:
        li_matches = ((lili_allContigs[i][2].split(" "))[1]).split(";")
    except IndexError:
        li_matches = []
    sys.stderr.write("On %s\n" % str_chr)
    for j in range(len(li_matches)):
        if j % 1000 == 0:
            sys.stderr.write("On site %s\n" % j)
        int_beg = int(li_matches[j]) + 1 - int_len
        if int_beg <= 1:
            int_beg = 1
        str_argGen = "-g" + str_genome
        str_argChr = "-c" + str_chr
        str_argBeg = "-b%s" % int_beg
        str_argLen = "-l%s" % (int_len + int_len + int_lenREsite)
        output = check_output(["get_genome_seq", str_argGen, str_argChr, str_argBeg, str_argLen])
        li_output = output.split("\n")
        str_seq = (li_output[1].split(" "))[1]
        str_IDChr = (li_output[4].split(" "))[1]
        #Get upstream of the RE site
        str_IDBegU = str(int_beg)
        str_IDEndU = str(int_beg + int_len + int_lenREsite)
        str_fastaID = ">" + str_IDChr + "_" + str_IDBegU + "_" + str_IDEndU + "_" + str_RE + "_U"
        sys.stdout.write(str_fastaID + "\n")
        sys.stdout.write(str_seq[0:int_len + int_lenREsite] + "\n")
        #Get downstream of the RE site
        str_IDBegD = str(int_beg + int_len)
        str_IDEndD = str(int_beg + int_len + int_len + int_lenREsite)
        str_fastaID = ">" + str_IDChr + "_" + str_IDBegD + "_" + str_IDEndD + "_" + str_RE + "_D"
        sys.stdout.write(str_fastaID + "\n")
        sys.stdout.write(str_seq[int_len:] + "\n")
        int_numREsites = int_numREsites + 1
sys.stderr.write("Number of restriction sites: %s\n" % int_numREsites)
To use it, the reference genome has to be indexed with Biopieces:
#Index the reference genome with Biopieces so that we can search it with get_genome_seq
read_fasta -i sbi1.fasta | format_genome -f fasta -g Sbi1 -x
#Then run the Python script. This takes a very long time in the current implementation (~1 hour for 22k RE sites). 
time python python.py FseIout.txt > FseIRADtags.fa

Comments