Saturday, March 29, 2014

Keystone Symposia on Big Data in Biology

The Llama and I just got back from the 2014 Keystone Symposia on Big Data in Biology. The meeting was enjoyable, and it was nice to see what the cancer biology folks are doing to manage their data deluge. The organizers were Lincoln Stein, Doreen Ware, and Michael Schatz.

The major points of interest that I took away were, in no particular order:

  1. General need/interest in the community in the adoption of standardized file formats or (more importantly) standardized file interface application programming interfaces (APIs).
  2. The scale of genomic analyses are causing a shift to colocated computing (e.g. cloud computing).
  3. General need/interest in the community for "standardized" analysis pipelines or "recipes" since methodological differences in analyses are causing reproducibility problems.
  4. The community acknowledges that we can rapidly generate large amounts of data, but we're barely keeping our heads above water for storage and analysis, and we're still pretty bad at translating the data into actionable information.
  5. Different variant callers give different results. The GATK is generally considered one of the best performing programs for calling SNPs, but the jury is still out for indel calling. A DREAM competition is coming soon for variant callers that may help with benchmarking (https://www.synapse.org/#!Synapse:syn312572/wiki/60874).
  6. General interest in the community for new models of reference genomes. Instead of monolithic strings representing the "one true reference", reference genomes would be modeled as graphs to represent pangenomes.
  7. At the poster session, we learned of a method for prioritizing candidate genes that are found under a QTL interval, and we hope it will be published on soon so we can use it.
  8. At the poster session, we learned of a method for mathematically modeling sequence strings and scoring them based on a training set (i.e. scoring them for transcription factor binding sites); we hope to use this one soon too once it's published.
  9. iPlant may be a useful cloud computing resource that we want to look further.
  10. We learned some of the general "good" practices for analysis of data at scale which may be applicable to our pipelines.
  11. At this scale of analysis, implementation matters a great deal; implementations in languages like R, Perl, and Python suffer in performance relative to implementations in C.
  12. BWA (specifically BWA mem) is generally accepted in the community as the de facto read mapping algorithm.
  13. There is a disconnect between what biologists think it takes and what it really takes to manage and analyze the data; biologists frequently underestimate the resources (time, effort, and hardware) required for the analyses they propose.
  14. The IBM Computational Biology group could make a tremendous splash once their text based learning method using Watson is released (it basically reads the available literature and provides researchers with potential leads).

No comments:

Post a Comment