Friday, February 13, 2015

Postmortem: Resolution of Genetic Map Expansion Caused by Excess Heterozygosity in Plant Recombinant Inred Populations

The Llama and I just got a manuscript through review on the RIG workflow, and it's a good time for some postmortems. This postmortem is about Truong et al. (2014) Resolution of Genetic Map Expansion Caused by Excess Heterozygosity in Plant Recombinant Inbred Populations.

The story for this one starts when the Llama and I received about 6 lanes worth of sequence data for an experimental sorghum cross composed of 400+ individuals. We got the sequence data processed to genotypes using a early iteration of the RIG workflow (at that time it had a weirder more awesomer moniker: the LERRG workflow), and we were trying to construct a genetic map using the 10,000+ genetic variants that had been called using R/qtl. However, regardless of what we tried, we kept getting massive genetic maps that were well over 100% larger than what we expected. We even posted a question to the R/qtl forums.

As hashed out in that forum thread, it turns out that something generates what manifests as tight double crossovers in the genotype data. This phenomenon seems to occur with multiple types of genotyping technologies, so it's unclear as to whether these are biologically real or artifacts of the technology. Removing these tight double crossovers shrinks the map to more reasonable sizes, but we were still left with an unexpectedly large map. Nearing the end of our rope, we continued to scour the literature for causes of genetic map expansion. We came across a publication by Knox and Ellis (2002) titled "Excess Heterozygosity Contributes to Genetic Map Expansion in Pea Recombinant Inbred Populations"; our population also displayed excess levels of heterozygosity relative to a Mendelian model. While the Knox and Ellis paper didn't give conclusive reasonings as to why it caused genetic map expansion, it provided enough precedent to try to pursue it further.

The Llama went back to the old Haldane and Waddington models and some newer models on differential zygotic viability, and she cranked out a general solution for the genotypic frequencies expected for a recombinant inbred line population that hadn't yet gone to fixation and that had differential fitness of heterozygotes. I chucked it into the right spot in the R/qtl C code for genetic map estimation, and it worked out just fine. Or at least that's how we would have liked for it to have gone. The ride wasn't quite so smooth.

Getting things working involved a large number of false starts because of things like the general solution wasn't quite right, or the implementation wasn't quite right. It became somewhat of an obsession to make it work, and we'd spend hours after we got home at night testing, fixing, and rebuilding. We eventually got it worked out in a San Francisco airport on our way home from the Keystone Symposia on Big Data in Biology: we had a general solution and an implementation that worked for the test cases.

As predicted by Knox and Ellis, the excess heterozygosity did indeed expand the genetic map. However, it turned out that excess heterozygosity doesn't always expand the map; it depends on the generation interval and the amount of heterozygosity. Ultimately, the take home message is that tight double crossovers are a major source of genetic map expansion, and that segregation distortion can also lead to genetic map expansion.

We wrote the manuscript up, and after a couple of rapid rejects from PLOS Genetics and Genetics, G3: Genes|Genomes|Genetics sent it off for review. The G3 reviewers and editor were fair and informed. One reviewer suggested that we do a simulation study to demonstrate the method's efficacy, and, in hindsight, we're grateful for that suggestion.

What worked: 
Test cases - The only way we got the method working was that we finally made test cases with known output for the expected genotype probabilities given a generation interval and heterozygosity levels. Having this made it much easier to diagnose problems.
G3 - The associate editor and reviewers gave the manuscript a fair and complete review. They picked out reasonable weaknesses, and made useful suggestions that improved the paper.
R\qtl - This work almost certainly could not have been done if Karl Broman (and colleagues) did not make the R\qtl code base available on GitHub. R\qtl may not be the prettiest code base around, but it was sufficiently documented that we could find our way around and bolt our pieces onto it. R\qtl has been chugging along for more than 10 years, and we hope it chugs along for many more.

What didn't work:
Not making test cases earlier - We probably could have saved ourselves a few weeks of consternation had we initially set up test cases. Guess what we did in the beginning - We took the implementation, tossed the 10,000+ markers in, and estimated the whole map to see if the size changed. When that didn't work, we tried it for individual chromosomes. Then we tried it for subsets of individual chromosomes. Then we tried it for pairs of markers. It wasn't until around the time of the Big Data conference that we wised up and got to the root of the output: the genotype probabilities. Only then, when we knew the expected probabilities for a given generation interval and heterozygote advantage, were we able to properly debug the issue.