Friday, March 7, 2014

GATK Runtime Error: Somehow the requested coordinate is not covered by the read

This error sometimes crops up for me and others in the Unified Genotyper from GATK v 2.8. As this post on the GATK forums indicates, none of us can reliably reproduce it so nothing can really be done at this point. The reads at loci causing the crash all look normal, and subsetting the loci that causes the crash in one instance doesn't reproduce the crash when run alone. With hopes it will automagically go away in the imminent GATK v 3 release.

In the meantime, here's a hacked together method to get variants called for the entire genome. It's just a bash loop that loops through all of the genomic intervals in a file until they successfully complete. The output .vcf files can then be merged back to one file.

First you need an interval file. I just generated mine using awk and the .vcf file produced by a Unified Genotyper run that crashed:
awk '{ 
        if (substr($0, 1, 8) == "##contig") {
                split($0, stringArray, "=")
                split(stringArray[3], subStringArray, ",")
                print subStringArray[1]
        }
        if (substr($0, 1, 2) != "##") {
                exit 0
        }
}' crashedUGoutput.vcf > intervalFile.intervals
Which should make an interval file with something like (depending on how your contigs are named):
chromosome_1
chromosome_2
Then you can do something like the following loop to process each interval until it succeeds:
INPUTFILES=$INPUTPATH*.bam

for file in $INPUTFILES
do
        inputString="${inputString} -I ${file}"
done

INITIALVARS=${FILEID}_rawVariants

mkdir tmpVCF
rm ./tmpVCF/log.txt
rm ./tmpVCF/*
while read interval
do
        fail=1
        while [ $fail != 0 ]
        do
                echo "Attempting UG on $interval"
                echo "Attempting UG on $interval" >> ./tmpVCF/log.txt
                java -Xmx${MEMORY} -jar ${GATKPATH} \
                -T UnifiedGenotyper \
                -R ${REFERENCE} \
                ${inputString} \
                --genotyping_mode DISCOVERY \
                --genotype_likelihoods_model BOTH \
                -stand_emit_conf 30 \
                -stand_call_conf 30 \
                -nct ${NUMTHREADS} \
                -L $interval \
                -o ./tmpVCF/${INITIALVARS}_$interval.vcf
                fail=$?
        done
done < intervalFile.intervals
Following this, you'll have a .vcf file for each interval stored in the ./tmpVCF directory, and they can be merged back into one .vcf.

No comments:

Post a Comment