Saturday, March 29, 2014

Keystone Symposia on Big Data in Biology

The Llama and I just got back from the 2014 Keystone Symposia on Big Data in Biology. The meeting was enjoyable, and it was nice to see what the cancer biology folks are doing to manage their data deluge. The organizers were Lincoln Stein, Doreen Ware, and Michael Schatz.

The major points of interest that I took away were, in no particular order:

  1. General need/interest in the community in the adoption of standardized file formats or (more importantly) standardized file interface application programming interfaces (APIs).
  2. The scale of genomic analyses are causing a shift to colocated computing (e.g. cloud computing).
  3. General need/interest in the community for "standardized" analysis pipelines or "recipes" since methodological differences in analyses are causing reproducibility problems.
  4. The community acknowledges that we can rapidly generate large amounts of data, but we're barely keeping our heads above water for storage and analysis, and we're still pretty bad at translating the data into actionable information.
  5. Different variant callers give different results. The GATK is generally considered one of the best performing programs for calling SNPs, but the jury is still out for indel calling. A DREAM competition is coming soon for variant callers that may help with benchmarking (https://www.synapse.org/#!Synapse:syn312572/wiki/60874).
  6. General interest in the community for new models of reference genomes. Instead of monolithic strings representing the "one true reference", reference genomes would be modeled as graphs to represent pangenomes.
  7. At the poster session, we learned of a method for prioritizing candidate genes that are found under a QTL interval, and we hope it will be published on soon so we can use it.
  8. At the poster session, we learned of a method for mathematically modeling sequence strings and scoring them based on a training set (i.e. scoring them for transcription factor binding sites); we hope to use this one soon too once it's published.
  9. iPlant may be a useful cloud computing resource that we want to look further.
  10. We learned some of the general "good" practices for analysis of data at scale which may be applicable to our pipelines.
  11. At this scale of analysis, implementation matters a great deal; implementations in languages like R, Perl, and Python suffer in performance relative to implementations in C.
  12. BWA (specifically BWA mem) is generally accepted in the community as the de facto read mapping algorithm.
  13. There is a disconnect between what biologists think it takes and what it really takes to manage and analyze the data; biologists frequently underestimate the resources (time, effort, and hardware) required for the analyses they propose.
  14. The IBM Computational Biology group could make a tremendous splash once their text based learning method using Watson is released (it basically reads the available literature and provides researchers with potential leads).

Friday, March 7, 2014

GATK Runtime Error: Somehow the requested coordinate is not covered by the read

This error sometimes crops up for me and others in the Unified Genotyper from GATK v 2.8. As this post on the GATK forums indicates, none of us can reliably reproduce it so nothing can really be done at this point. The reads at loci causing the crash all look normal, and subsetting the loci that causes the crash in one instance doesn't reproduce the crash when run alone. With hopes it will automagically go away in the imminent GATK v 3 release.

In the meantime, here's a hacked together method to get variants called for the entire genome. It's just a bash loop that loops through all of the genomic intervals in a file until they successfully complete. The output .vcf files can then be merged back to one file.

First you need an interval file. I just generated mine using awk and the .vcf file produced by a Unified Genotyper run that crashed:
awk '{ 
        if (substr($0, 1, 8) == "##contig") {
                split($0, stringArray, "=")
                split(stringArray[3], subStringArray, ",")
                print subStringArray[1]
        }
        if (substr($0, 1, 2) != "##") {
                exit 0
        }
}' crashedUGoutput.vcf > intervalFile.intervals
Which should make an interval file with something like (depending on how your contigs are named):
chromosome_1
chromosome_2
Then you can do something like the following loop to process each interval until it succeeds:
INPUTFILES=$INPUTPATH*.bam

for file in $INPUTFILES
do
        inputString="${inputString} -I ${file}"
done

INITIALVARS=${FILEID}_rawVariants

mkdir tmpVCF
rm ./tmpVCF/log.txt
rm ./tmpVCF/*
while read interval
do
        fail=1
        while [ $fail != 0 ]
        do
                echo "Attempting UG on $interval"
                echo "Attempting UG on $interval" >> ./tmpVCF/log.txt
                java -Xmx${MEMORY} -jar ${GATKPATH} \
                -T UnifiedGenotyper \
                -R ${REFERENCE} \
                ${inputString} \
                --genotyping_mode DISCOVERY \
                --genotype_likelihoods_model BOTH \
                -stand_emit_conf 30 \
                -stand_call_conf 30 \
                -nct ${NUMTHREADS} \
                -L $interval \
                -o ./tmpVCF/${INITIALVARS}_$interval.vcf
                fail=$?
        done
done < intervalFile.intervals
Following this, you'll have a .vcf file for each interval stored in the ./tmpVCF directory, and they can be merged back into one .vcf.