Codex Technicanum

Querying NCBI dbSNP for rsID mergers with Python (a.k.a. where is RsMergeArch.bcp.gz?)

2017-07-23T10:15:00.000-07:00

One of our current projects is a systems genetics project that involves the interrelation of multiple genetic datasets containing human genetic variants. The different experiments were performed and analyzed relative to different reference genomes and dbSNP builds, so one of the first steps is making sure which variants correspond in each dataset. Since different builds of dbSNP were used, the rsIDs don't always have a one to one correspondence; some rsIDs are merged and others lost. So, the goal here was to update the rsIDs from the older build to the rsIDs used in the newer build.

It looks like, at one point in the history of dbSNP, flat files of rsID mergers were hosted via FTP in a file called RsMergeArch.bcp.gz (e.g., https://www.biostars.org/p/128601/ and https://www.ncbi.nlm.nih.gov/books/NBK44496/#Schema.where_do_i_find_a_list_of_merged ). Unfortunately, I was unable to find such a file for dbSNP 150. However, through the web interface, you can indeed search for "mergedrs" at Entrez SNP (https://www.ncbi.nlm.nih.gov/snp) and get the merger information. Moreover, the "Chromosome Report" output can be downloaded as a flatfile that contains most of the information necessary for the mapping of rsIDs. Given all of that, it seemed the best way to get and process the updates would be by programmatically interacting with the database.

Fortunately, Biopython.Entrez enables exactly this (http://biopython.org/DIST/docs/api/Bio.Entrez-module.html). If you're using anaconda, something like the following should get you biopython:

conda install biopython

Then you can use something like the following to fetch those records (a bit more documentation regarding efetch can be found at https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch ):

from Bio import Entrez
import sys

Entrez.email = 'yourEmail@institution.com'

handle = Entrez.efetch(db="snp", id="rs140337953, rs116440577, rs150021059", rettype = 'chr', retmode = 'text')

sys.stdout.write(handle.read())
handle.close()

The output printed to stdout here is the same content that you'd get if you ran the search with the SNPs, e.g. https://www.ncbi.nlm.nih.gov/snp/?term=rs140337953+or+rs116440577+or+rs150021059 , and then sent it to file with chromosome report format.

You could write the records to a file, or parse the record string, and go on your merry way. Of note, there are a number of reports that the XML parsers within Bio.Entrez are unable to parse the dbSNP XML (and they also didn't work for me), so parsing the XML output of the fetch may not work; I ended up having to parse the text output. This blog here posts a more sophisticated example of querying and parsing dbSNP: http://www.danielecook.com/getting_snp_dat/ .

On an unrelated note, one of our colleagues keeps a bird feeder, and the messy birds planted a nice corn plant to make us feel a little closer to the field. Thanks, birds.

Watterson estimator calculation (theta or $\theta_w$) under infinite-sites assumption

2016-11-14T08:14:00.001-08:00

Continuing on from the previous post, I'm still working on making sense of population genetics metrics. Today's is the Watterson estimator, frequently depicted as $\theta_w$. Wikipedia defines this one as

$\hat{\theta}_w = \frac{K}{a_n}$

Where $K$ is the number of segregating sites (e.g., a SNP) and $a_n$ is the $(n-1)$th harmonic number.

To check my calculations this time around, I'm using PopGenome (whose calculations of $\pi$ also agree with the vcftools calculations from the previous post).

So, at a single SNP and a sample of 52 diploid individuals (104 chromosomes), the calculation should be

$\hat{\theta}_w = \frac{1}{a_{104}}$

To get the harmonic number, you can use a calculator or just implement it. Note that the number is calculated for $(n - 1)$. A quick example of calculating the harmonic number that I wrote in R:

harmonicNumber = 0
numChromosomes = 104
for (i in 1:(numChromosomes - 1)) {
  harmonicNumber = harmonicNumber + 1.0/i
}
print(harmonicNumber)

For 104 chromosomes, the harmonic number used for the denominator is 5.216791. So for three different windows where the first window contains 0 SNPs, the second window contains 1 SNP, and the third window contains 2 SNPs, PopGenome reports the following values for $\theta_w$

> slide@theta_Watterson
         pop 1
[1,]        NA
[2,] 0.1916887
[3,] 0.3833774

And we can calculate those same values from $\frac{1}{5.216791} = 0.1916887$ and $\frac{2}{5.216791} = 0.3833774$.

Dividing that by the size of the window (i.e. to get a per bp estimate) seems to be what folks do when calculating sliding window averages of $\theta_w$. Wikipedia indicates that $\theta_w$ is a measure of the "population mutation rate" which seems appropriate enough; when calculated on the basis of sliding windows, this appears to represent the density of genetic variants in a region scaled slightly for the number of individuals in the sample. This seems to indeed be corroborated by a response from the vcftools developer to a user looking for $\theta_w$, indicating that using the --SNPdensity option from vcftools provides all of the necessary information for calculating the Watterson estimator.

As a final note, this appears to be the calculation under the infinite-sites assumption. The finite-sites calculation is evidently different as indicated in the LDhat documentation, though I'm not sure what that difference is yet. It looks like McVean et al. (2002) has additional information regarding that difference.

Nucleotide diversity (pi or $\pi$) calculation (per site and per window)

2016-11-07T13:19:00.003-08:00

Part of a project I'm working on is looking at how genetic variation accumulates in genomes. Population genetics has developed a number of metrics for evaluating diversity, and while I see them used quite often in papers, it's not always clear to me what they're actually measuring.

Today I had a look at a measurement of nucleotide diversity called pi ($\pi$). Trying to find a good definition of it, I repeatedly came across the same definition provided by Wikipedia: "the average number of nucleotide differences per site between any two DNA sequences chosen randomly from the sample population".

$\pi = \sum\limits_{ij}^{} x_i x_j \pi_{ij} = 2 \times \sum\limits_{i=2}^{n} \sum\limits_{j=1}^{i-1} x_i x_j \pi_{ij}$

where $x_i$ and $x_j$ are the respective frequencies of the $i$th and $j$th sequences, $\pi_{ij}$ is the number of nucleotide differences per nucleotide site between the $i$th and $j$th sequences, and $n$ is the number of sequences in the sample.

I sort of conceptually understand that in the context of looking at specific genes across a population, but I didn't really grok the actual values reported for $\pi$ values in a genome-wide scan. So, to figure out what those plots are actually showing, I had a look at the calculations made by vcftools. Specifically, I was looking for how the results of --site-pi and --window-pi are calculated.

For --site-pi, the function of interest is output_per_site_nucleotide_diversity() in variant_file_output.cpp, and the code of interest is

for(unsigned int allele = 0; allele < N_alleles; allele++)
{
   int other_alleles_count = (total_alleles - allele_counts[allele]);
   mismatches += (allele_counts[allele] * other_alleles_count);
}

// pairs is the number of pairwise combinations.
int pairs = (total_alleles * (total_alleles - 1));
double pi = (mismatches/static_cast<double>(pairs));

So, given biallelic loci, this seems to be proportional to allele frequency, and we can calculate $\pi$ for a single position simply from the VCF annotations AC (alternate allele count) and AN (allele number). This seems to use number of possible pairwise allelic combinations (which would be AN * (AN-1)) as the denominator and the number of comparisons that would not be equivalent (i.e. mismatches) when comparing all pairwise combinations (which would be AC * (AN - AC) + (AN - AC) * AC).

So, for this variant:

chromosome_1    186     .       G       T       546.80  PASS    AC=4;AF=0.040;AN=100;DP=750;FS=0.000;GQ_MEAN=39.08;GQ_STDDEV=36.12;InbreedingCoeff=0.6092;MQ=60.00;MQ0=0;NCC=4;QD=27.34;SOR=2.531;VQSLOD=12.99;culprit=FS

$\pi = ( 4 * (100 - 4) + (100 - 4) * 4 ) / (100 * (100 - 1)) = 0.0775758 $

As far as I can tell, the single-site $\pi$ shouldn't go beyond 0.5 for practical applications with meaningfully large population samples (though you'll note that one heterozygous individual will give a $\pi$ of 1).

Okay, so what about the --window-pi option? The function of interest in vcftools is output_windowed_nucleotide_diversity(), and it's basically the same thing as the single site calculation except it considers the monomorphic sites in the window as well:

N_pairs = bins[CHROM][s][N_variant_site_pairs] + (N_monomorphic_sites * N_comparisons);
pi = bins[CHROM][s][N_mismatches] / double(N_pairs);

So, suppose we have the window 201 - 300 and the variants:

chromosome_1    275     .       C       T       7509.57 PASS    AC=20;AF=0.196;AN=102;DP=968;FS=0.000;GQ_MEAN=50.00;GQ_STDDEV=44.84;InbreedingCoeff=0.9368;MQ=60.00;MQ0=0;NCC=3;QD=31.29;SOR=1.456;VQSLOD=12.40;culprit=FS
chromosome_1    291     .       T       A       1247.11 PASS    AC=4;AF=0.040;AN=100;DP=945;FS=0.000;GQ_MEAN=45.96;GQ_STDDEV=32.66;InbreedingCoeff=0.8539;MQ=60.00;MQ0=0;NCC=4;QD=29.00;SOR=1.455;VQSLOD=12.26;culprit=FS

This time we also consider the monomorphic sites in the calculation (though this calculation doesn't seem to gracefully consider an absence of read data), which increases the number of pairs that are considered. This time around, our number of pairwise comparisons is the number of pairwise comparisons for the variant sites (for each variant, sum (AN * (AN - 1))) plus the number of monomorphic sites (2 variants, so 98 sites were non-variant in the 100 bp window) times the number of pairwise chromosome comparisons (in this example, there were 52 samples, so it was 104 * 103). So we have:

$nPairs = ( 102 * (102 - 1) + 100 * (100 - 1) ) + (100 - 2) * (104 * (104 - 1)) = 1069978 $

And the number of mismatches is the same as before except summed for each variant (i.e. AC * (AN - AC) + (AN-AC) * AC):

$nMismatches = (20 * (102 - 20) + (102 -20) * 20) + (4 * (100 - 4) + (100 - 4) * 4) = 4048 $

So $\pi$ for this 100 bp window is:

$\pi = 4048 / 1069978 = 0.0037832554 $

Okay, so as far as I can tell, $\pi$ is proportional to how many variants are present (for the window-based calculation) and increases as minor allele frequency grows (for both). That seems like an appropriate metric to describe how much nucleotide diversity is present at a given locus in a given population.

Postmortem: Chillennium 2016 48-Hour Game Jam - Tardigrade's Dangerous Day

2016-10-01T13:22:00.001-07:00

The Llama and I participated in the Chillennium 2016 Game Jam a couple of weekends ago and made our first video game: Tardigrade's Dangerous Day. The game jam was a 48-hour competition to make a game with the theme "foofaraw" (which means a big fuss about a minor matter). If you're interested in the technical details, read on; otherwise, you can check it out on itch.io here and play it on HyperDev here - please note that the audio only seems to be working in Google Chrome.

Since tardigrades had been in the science news recently and tardigrades are all around pretty bonkers, we decided to make the game about a (not to scale) tardigrade.

Our stalwart tardigrade protagonist

We wanted the game to be browser-based so that it would be portable and allow users to quickly try the game. JavaScript or using something like Emscripten to port C++ to JavaScript are the only good options I know of for writing browser-based games. While we aren't that familiar with JavaScript, we chose to write the game in it because we didn't want to deal with the additional complexity of Emscripten.

Our experience with JavaScript is limited, but it does seem to be a powerful language. Projects like node.js let you write both your server-side and client-side code in JavaScript, so one language can be used for the full stack. While we didn't have much (or anything, really) in the way of server-side code, node.js was used server-side to serve up the game page.

We used HyperDev as a collaborative IDE for web application development. It updates your application in real time as you write code (and also uses node.js for serving), and it functions mostly like Google Docs in terms of multiple people being able to concurrently write in a document. HyperDev is still in beta, but it worked well for the most part (though it didn't play nicely with an intermittent internet connection).

With respect to game engines, there are a handful of good JavaScript engines out there, and we chose to use Phaser. It handles asset loading and display, user input, has a basic physics engine, good tutorials, an active community; overall, it worked very well. It doesn't seem to have any dedicated functionality for multi-player games that would run the physics simulation server-side and update the clients, but everything was single-player and client-side in our game so that wasn't a problem. The only issue we ran into was that the audio only seemed to work on Google Chrome browsers; Mozilla Firefox and Internet Explorer didn't seem to want to play the .ogg audio files, but I'm not sure if that's a Phaser issue or something else.

UV rays are no joke, safety crab

The artwork was done in Inkscape and entirely by the Llama. Using SVG assets was great for size-scaling purposes, and I much preferred them to pixel-based sprites both from an artistic and technical perspective. The Llama's artistic ability is top-notch, and the artwork is definitely the best thing about the game.

The music was recorded using Audacity by setting up a microphone in front of a speaker hooked up to my electric guitar. This wasn't ideal, and the sound quality suffered (it definitely did not capture the full awesome of my sick guitar skillzzzz). Buying a cable that connects the guitar straight to the computer for recording would probably have been much better.

Get ready for 60 seconds of fun and laughter

The game only lasts about 60 seconds, but overall we're happy with how it turned out. We went into the jam with a simple goal since we had never made a game before: make a program that draws assets on the screen and responds to user input. We succeeded in meeting that goal, and we made something that gives us a laugh when we play it.

Now for what worked and what didn't.

What worked

Free software and hosting platforms - HyperDev as a web app development platform, Phaser for the game engine, Inkscape for art assets, Audacity for music assets, and itch.io for hosting; they were all great. I saw a lot of folks using Unity and Unreal Engine, but I'm quite happy with how our toolchain worked.
The Llama's artwork - If we fail to make it as scientists, I'm pretty sure the Llama could make it as a graphic designer or artist.
Chillennium, its organizers, and its volunteers - They put together a great game jam with good food and attentive staff, and they stayed on schedule.
Not gorging on candy and Red Bull, and getting some sleep - I'm pretty sure there were enough chips, candies, Slim Jims, sodas, and Red Bulls around such that we could have subsisted exclusively on that, and I almost did in my initial excitement at their availability. Fortunately, better judgement won out, and we didn't destroy ourselves. We probably only spent a total of 18 hours working (so about 36 people-hours); we stuck with a regular sleep schedule and ran regular weekend errands like grocery shopping. Maybe a lack of putting in the hours contributed to not winning any awards, but I'm happy that we didn't have to spend any time "recovering" from the jam.

What didn't:

My limited understanding of JavaScript - I think this hurt production the most. I spent valuable time trying to figure out the scope of variables in callbacks, and I still don't really understand it. I also didn't really grok how to instantiate member variables in a class, or even define a class; it seemed like you could just write this.memberVariable and poof, now your object has memberVariable. I need to spend some more time learning the basics since JavaScript is definitely not like the C family of languages.
Cross-Origin Resource Sharing - I spent a silly amount of time trying to figure out how to deal with a Cross Origin error: "Cross-origin image load denied by Cross-Origin Resource Sharing policy". Since the assets were loaded from HyperDev and I didn't control the server with the assets, I couldn't enable CORS on it to serve up the images. Setting the Phaser loader to anonymous by this.load.crossOrigin = "anonymous"; worked, but I still need to learn more about what CORS is.
Audio works only in Google Chrome - Still haven't gotten to the bottom of this one. Other browers show the audio icon as though something should be playing, and I'm using .ogg audio files that should be compatible, but don't seem to get output.
Too much text - We watched a few of our labmates play our game, and it was evident that we did not consider our audience. As casual gamers ourselves, we now realize that we don't often read text in games. However, we lacked the time (or maybe the will) to create more graphical content to tell our story; instead we just wrote it out in text. Our players clicked right through it without reading, and then were confused later. We needed to make the whole experience more intuitive with less text.

Postmortem: 3D sorghum reconstructions from depth images identify QTL regulating shoot architecture

2016-08-22T06:35:00.001-07:00

One of the approaches that researchers are using to improve plant productivity is high-throughput plant phenotyping. The objective is to use computing and robotics to increase the number of plants that can be evaluated for traits of interest. As the number of plants that can be screened increases, the rate at which the genetic basis of traits can be identified increases, and we can make better selection decisions for breeding. The way I see it, the future of plant phenotyping is going to be large numbers of cheap, autonomous robots that can crawl or fly a field or greenhouse, gather data, and send that data for genetic analysis and breeding decisions (as well as precision agriculture); I think the major hurdles are a lack of software, both for robot autonomy and data analysis, and the energy density of batteries.

The project that ultimately led to our recent manuscript "3D sorghum reconstructions from depth images identify QTL regulating shoot architecture" originated sometime around October 2014. Given our results with the RIG manuscript, I agreed with the general sentiment in the field that genotyping was not the bottleneck for crop improvement, and that phenotyping was the limitation. While industry platforms like Lemnatec and Traitmill and public platforms like PlantScan had been been out for a little while, we were looking for something that required less static infrastructure, especially since one of our major interests is bioenergy sorghum; 3+ meter tall bioenergy sorghum plants require more flexible acquisition platforms.

Sometime around Fall every year, I tend to get on an annual game development kick (notice that last Fall in November 2015 there are posts on procedural content generation, and this year the Llama and I are registered for the September Chillenium game jam at Texas A&M). Similarly, in Fall 2014, I was on my annual game development kick, and I was interested in the idea of the Microsoft Kinect as an input device (version 2 to be specific). Since the Kinect (and related cameras that use structured light or time-of-flight) directly samples information about the geometry of scenes at high rates and was relatively cheap, it seemed like an good fit for building a plant morphology phenotyping platform. Even if it didn't work, we only put the boss out ~$150, and we still learned a little about hardware interfaces and image processing.

3D sorghum - we have more than a thousand of these things now.

As it turns out, it worked pretty well, and after about 6 months of tooling around with some prototypes, we had a workflow in place and had enough working test cases to merit a larger scale experiment. This was around the time of the ARPA-E TERRA project proposals, so plant phenotyping was starting to get more attention in the United States, and our boss was onboard.

We grew the plants, imaged them loads of times, processed the data, and wrote the manuscript. The review process was productive; the editor and reviewers provided very helpful and critical feedback that improved the manuscript substantially.

The publication of this manuscript marks the completion of much of what the Llama and I had hoped to accomplish in our graduate studies. When we entered the TAMU Genetics program 4 years ago, we wanted to become proficient in interdisciplinary research and use mathematics and computing to answer questions in genetics. We now have publications that use modeling, high-performance computing, and image processing to understand the genetic basis of quantitative traits and plant physiology, we can program in multiple languages, and we can comfortably talk shop with a range of disciplines. We're happy with how things turned out in graduate school, and we're grateful for the training opportunities and interactions that we've had while we've been here. Over the next year we'll try to polish off some additional manuscripts in the queue and try to get a job in research.

Now for the what worked and what didn't.

What worked:

Open source libraries - Particularly RapidXML, OpenCV, and PCL. Open source is the best. Being able to browse the source code to step through the code was priceless for learning and debugging. In this spirit, we put all of our code on GitHub.
GitHub for code hosting - Free, simple, publicly accessible, and has a good user interface. If you write non-trival code for an academic publication, please just put it online somewhere so that folks like me can look at it. Ideas are cheap; implementations are valuable.
Dryad for data hosting - Acting as a data repository for research data seems to be an impossible and thankless task, particularly as data sets are exploding in size. Dryad was professional and efficient.

What didn't:

Getting bogged down in the details - The first draft of the manuscript that we sent for review was a bit heavy handed with extraneous detail. We didn't develop any elegant algorithms, but that didn't stop me from talking about our methods excessively and using far too much jargon; these got in the way of the overall message of our manuscript. Pruning this out and moving some of it to the supplemental benefited the manuscript.
Not merging RGB and depth and not using templating properly - Originally, I was only interested in the depth images since I figured the RGB data wouldn't provide much useful information (the plant is rather uniformly green after all, and variation in environmental lighting was a complexity I didn't want to handle). This ended up as most of the code base being designed specifically for clouds with point types of pcl::PointXYZ rather than using a template. If I were to do it again, I would have gone ahead and mapped the RGB to the depth data for figure generation and used templates to make the code base more generic (the same way PCL uses templates).

Our first bioRxiv submission

2016-07-15T13:49:00.002-07:00

Our recent manuscript on image-based phenotyping in sorghum has been in the review process for a while now, so we decided to post it on the bioRxiv preprint server (after we received permission from the journal peer review manager and editor) to get it out in the wild prior to publication. The premise of the bioRxiv (and its inspiration in the physical sciences, the arXiv) is to make scientific findings immediately available, and Needhi Bhalla wrote a useful primer on the pros and cons of submitting your work to a preprint server prior to publication.

I'm not sure if we'll always choose to post a preprint, but I think it makes sense this time since (1) we know the manuscript has already been shared with others outside of our intended circle of collaborators, and (2) this contribution is more timely than our past findings given the goals of the ARPA-E TERRA initiative. Even though pre-prints don't hold the same prestige as peer-reviewed journal articles, there will at least be timestamped documentation and source code that can be used by others and can be shown to potential future employers.

Without further ado, the following text links to a preprint documenting an image-based phenotyping platform we developed and quantitative trait loci that we identified using it: "3D sorghum reconstructions from depth images enable identification of quantitative trait loci regulating shoot architecture" by McCormick, Truong, and Mullet.

The software we developed can be found on GitHub by following this link. Many previous posts on this blog were written in the context of developing that code.

And most importantly, here are some 3D sorghum plants.

Robot Hexapod (RHex)

2016-06-09T09:08:00.000-07:00

Lately we've been musing about what the right strategy will be for agricultural phenotyping solutions, and we're currently thinking about the advantages of large numbers of cheap, small, legged, autonomous robots that can crawl the fields and image plants. These are a few ideas that we want to come back to later if we ever get around to prototyping something.

Boston Dynamics built a rugged version of the robot hexapod archetype (RHex).

The Koditschek lab works extensively on the RHex (I believe RHex was born of a DARPA project that Koditschek was on):
Koditschek lab site link

And a couple of relevant manuscripts from Koditschek:
Saranh, Buehler, Koditschek, (2001) RHex: A simple and highly mobile hexapod robot
Weingarten, Buehler, Groff, Koditschek (2003) Gait generation and optimization for legged robots
Haynes, Pusey, Knopf, Johnson, Koditschek (2012) Laboratory on legs: an architecture for adjustable morphology with legged robots

This is a student project for building an RHex that uses a Kinect for navigation (RHex Box); it has a bit of content for a DIY RHex:
Project website link

Getting started with Intel's Threading Building Blocks: installing and compiling an example that uses a std::vector and the Code::Blocks IDE

2016-04-20T12:06:00.001-07:00

I hit a problem where it looks like I'll have to parallelize execution. After reading a bit about it, I opted to use Intel's Threading Building Blocks over OpenMP. The consensus seemed to be that Intel TBB is better suited for slotting into C++ projects. TBB also has a bunch of birds on its webpage, which is a plus in my book.

The most excellent TBB bird logo.

To download and install, I got the .tgz corresponding to the source code at the download page: https://www.threadingbuildingblocks.org/download

After unarchiving and decompressing, I invoked "make" in the directory (this was on an Ubuntu 14.04 OS). This built the debug and release targets in the "build" directory.

You can also invoke "make" in the "examples" directory; it builds and runs a number of fun examples that use TBB.

To set that up to compile with the Code::Blocks IDE, I did the following:

Go to the Project's Build Options -> Search Directories -> Linker and add the debug and release build directories (e.g. tbb44_20160128oss/build/linux_intel64_gcc_cc4.8_libc2.19_kernel3.19.0_debug).

Go to the Project's Build Options -> Search Directories -> Compiler and add the include directory (e.g. tbb44_20160128oss/include).

Go to the Project's Build Options -> Linker settings, and, under "Other linker options" add "-ltbb" (don't include the quotes).

To test the compilation with a simple use case, I modified one of the examples in the TBB parallel_for documentation. Here's the premise:

#include <iostream>
#include <vector>

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"

// Example adapted from: https://www.threadingbuildingblocks.org/docs/help/reference/algorithms/parallel_for_func.htm
class AverageWithVectors {
public:
    std::vector<float> *input;
    std::vector<float> *output;
    void operator()( const tbb::blocked_range<int>& range ) const {
        for( int i=range.begin(); i!=range.end(); ++i )
            (*output)[i] = ( (*input)[i-1] + (*input)[i] + (*input)[i+1]) * (1/3.f);
    }
};

int main() {
    std::vector<float> inputVector = {30.0, 70.0, 90.0, 120.0, 133.0, 199.0, 245.0, 266.0, 289.0};
    std::vector<float> outputVector = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
    AverageWithVectors avg;
    avg.input = &inputVector;
    avg.output = &outputVector;
    tbb::parallel_for( tbb::blocked_range<int>( 1, 7 ), avg );
    for (uint32_t i = 0; i < (*avg.output).size(); i++) {
        std::cout << "output of element " << i << ": " << (*avg.output)[i];
    }

    return(0);
}

The output should be something like:
output of element 0: 0
output of element 1: 63.3333
output of element 2: 93.3333
output of element 3: 114.333
output of element 4: 150.667
output of element 5: 192.333
output of element 6: 236.667
output of element 7: 266.667
output of element 8: 0

Note that this doesn't actually prove anything is being executed in parallel, only that it compiles and successfully calls tbb::parallel_for().

Finally, I came across a helpful post on the Intel forums from a user named Jim Dempsey. Most of it is reproduced below.

Generally speaking:
Use for (or for_each) when the amount of computation is small (not worth the overhead to parallelize).
Use parallel_for when you have a large number of objects and each object has equal work of small work.
Use parallel_for with appropriate grain size when you have a large number of objects and each object has un-equal work of small work.
Use parallel_for_each when each objects work is relatively large and number of objects is few and each object has un-equal work.
When number of objects is very small and known, consider using switch(nObjects) and cases using parallel_invoke in each case that you wish to impliment and parallel_for_each for the default case

Access point correspondences from registration method in Point Cloud Library (a.k.a. access protected member in library class without modifying library source).

2016-04-09T12:20:00.000-07:00

This post was inspired by an answer by user D.J.Duff on StackOverflow at this post

The goal was to get point correspondences between two point clouds that were identified during registration using the Point Cloud Library (PCL). The correspondences appear to be saved, but as a protected member of the Registration class. I wasn't interested in modifying the PCL source to add a getter function, so I took the route of exposing it via inheritance. This should work with any registration method that inherits from the Registration class. I think this is considered poor form to violate the class in this manner, but I'm not sure what other options are available without modifying the source class.

/** \brief This is a mock class with the sole purpose of accessing a protected member of a class it inherits from.
*
* Some of the relevant documentation for correspondences: http://docs.pointclouds.org/trunk/correspondence_8h_source.html#l00092
*/
template <typename PointSource, typename PointTarget, typename Scalar = float>
class IterativeClosestPointNonLinear_Exposed : public pcl::IterativeClosestPointNonLinear<PointSource, PointTarget, Scalar> {
  public:
    pcl::CorrespondencesPtr getCorrespondencesPtr() {
      for (uint32_t i = 0; i < this->correspondences_->size(); i++) {
        pcl::Correspondence currentCorrespondence = (*this->correspondences_)[i];
        std::cout << "Index of the source point: " << currentCorrespondence.index_query << std::endl;
        std::cout << "Index of the matching target point: " << currentCorrespondence.index_match << std::endl;
        std::cout << "Distance between the corresponding points: " << currentCorrespondence.distance << std::endl;
        std::cout << "Weight of the confidence in the correspondence: " << currentCorrespondence.weight << std::endl;
      }
      return this->correspondences_;
    }
}

And a bit of additional content from my answer on StackOverflow:

Here are some lines drawn between corresponding points between two point clouds obtained in this manner. Notably, the pcl::Correspondence.distance value between two points doesn't always seem to be strictly less than the value of the registration method's maxCorrespondenceDistance, so you may have to check the distances to make sure you're getting only the correspondences you want.

Resumes and C.V.s for graduating Ph.D.s

2016-03-15T13:20:00.002-07:00

The Llama and I are starting to gear up for a transition into the job market since graduation is on the (far) distant horizon. We're planning to apply for jobs in both the public and the private sector since we've heard that the job market is fairly bleak. By way of a summary of the advice for Ph.D. graduates to find a job, we've heard:

Maintain a web presence; employers should be able to Google you to learn more about you.

At minimum, maintain a Google Scholar and LinkedIn profile.

Prepare multiple job application documents targeted towards different domains; private industry generally isn't interested in the same factors as academia. This description of the difference between CVs and resumes seems to agree with other sources I've read.

Industry resumes should be short (i.e. 1 page) and focus on your skills and what you've delivered with those skills.
Academic C.V.s can be longer and should focus on your publication record.

You have an incredibly small window of time that you are under consideration, and you are competing against thousands of other applications. You need to convince the reader that you are worth further consideration within 5 to 7 seconds.

The advice regarding content differences between the academic C.V. and the industry resume confuses me a bit. Some advice recommends excluding publications. To me, peer-reviewed publications are the "deliverable" in academia; they are demonstrative of both your skill sets and ability to deliver. However, if the advice is recommending that the line

Developed and implemented a genome assembly algorithm and used it to assemble 1000 human genomes de novo in 50% less time than the field's standard.

is better than the line

Barkley, C., Bird, L., Jordan, M. (2015) Sequence assembly using de Bruijn graphs and colored dunking. Bioinfoballers 3:182-198.

then I agree with the premise.

Here are a few $\LaTeX$ templates that we think look pretty nice:

The pedigree on this template seems to be Matthew Boedicker begets David J. Grant begets Todd C. Miller begets Derek R. Hildreth; I found it both at the David Grant and the Derek Hildreth generation:
David Grant's template and Derek Hildreth's template

Another nice one originates from a ${\LaTeX}$ file posted by Bradley P. Carlin, which itself was adapted from a template by Theodore P. Palvic (Dr. Palvic's original can be found here).

And another one: Modern CV at latextemplates.com.

Procedural content generation: map construction via blind agent (a.k.a. building a Rogue-like map)

2015-11-04T06:37:00.002-08:00

I guess this is a continuation of the previous post on procedural content generation. As before, I'm just following along and implementing the pseudocode from the Procedural Content in Games book (in chapter 3; link to the .pdf).

This time we're using an agent-based approach as opposed to the binary space partitioning (BSP) used last time. This implementation is a blind digger; it seems to generate more organic layouts, but it also more frequently generates nonsensical layouts relative to the BSP method. Maybe that would be good for a cave map or something of that sort.

I used Python again. Not having to implement a tree data structure this time around made it much faster to implement. The function that controls the overall flow is below, and you can find the entire script on my GitHub here. I'm pretty sure there are some bugs crawling around still, but it seems to work.

def generateAgentDiggerMap():
    mapHeight = 50
    mapWidth = 50
    digger = BlindDigger()
    
    diggingMap = DiggingMap(mapWidth, mapHeight)
    digger.initializeDig(diggingMap)
    
    while (diggingMap.percentAreaDug < 40):
        digger.performDigIteration(diggingMap)
    
    diggingMap.plotDiggingMap()

Both the Llama and I passed prelims! Click here for part 1.

Procedural content generation: map construction via binary space partitioning (a.k.a. building a Rogue-like map)

2015-11-03T17:17:00.002-08:00

There are times that I question my market value and future job prospects as a life science Ph.D. (or whether I'll be able to finish the Ph.D. at all). When the Llama feels like this, she studies for Actuarial exams. I pursue something equally lucrative.

Video games.

These past few weeks have been spent preparing for our preliminary exams. The concomitant feelings of inadequacy led me to want to procedurally generate video game content. Seeing as I don't know the first thing about procedural content generation (PCG), I figured I'd make some maps. And make some maps I did.

Who wants to adventure through these bad boys!?

The method I used is based on binary space partitioning, and all I really did was implement the algorithm described by this book in chapter 3 (link to .pdf).

I implemented it in Python because I figured it'd be faster to prototype in than C++. I'm not sure that this was actually the case. I couldn't find a tree data structure and ended up implementing one (...and it's a bit of a mess). I also really wanted to declare a static variable to keep track of things during the recursive tree searches, but I don't think such a thing exists in Python; I end up passing lists around to get around this which seems pretty weird, but evidently lists are mutable in this case and booleans or integers are not? Anywho, it works for the most part. The function that controls the overall flow is below, and you can find the entire script on my GitHub here. I'm pretty sure there are some bugs crawling around still, but it seems to work.

def generateBSPMap():
    # 1: start with the entire area (root node of the BSP tree)
    rootNodeBox = Box((0, 0), 256, 256) #The dimensions should be evenly divisible by 2
    rootNode = AreaNode("root", defaultdict(AreaNode), rootNodeBox)
    tree = AreaTree(rootNode)
    firstPartitionNames = ("A", "B")
    # 2: divide the area along a horizontal or vertical line
    tree.partitionNode("root", firstPartitionNames)
    currentArea = rootNodeBox.getArea()
    currentPartitionNames = firstPartitionNames
    MAGICMINIMUMAREA = (0.03125) * 256 * 256
    #MAGICMINIMUMAREA = (0.10) * 256 * 256
    
    while (currentArea > MAGICMINIMUMAREA):
        # 3: select one of the two new partition cells
        chosenIndex = random.choice([0, 1])
        chosenPartition = currentPartitionNames[chosenIndex]    
        if (chosenIndex == 0):    
            otherPartition = currentPartitionNames[1]
        else:
            otherPartition = currentPartitionNames[0]
        
        #4: if this cell is bigger than the minimal acceptable size:
        print("Chosen partition " + chosenPartition + " has node area " + str(tree.getNodeArea(chosenPartition)))

        if (tree.getNodeArea(chosenPartition) > MAGICMINIMUMAREA):
            #5: go to step 2 (using this cell as the area to be divided)
            newPartitionNames = (chosenPartition + "_0", chosenPartition + "_1")
            tree.partitionNode(chosenPartition, newPartitionNames)
        
        #6: select the other partition cell, and go to step 4
        if (tree.getNodeArea(otherPartition) > MAGICMINIMUMAREA):
            newPartitionNames = (otherPartition + "_0", otherPartition + "_1")
            tree.partitionNode(otherPartition, newPartitionNames)
        
        currentArea = min([tree.getNodeArea(chosenPartition), tree.getNodeArea(otherPartition)])

        partitionNameList = []
        tree.getListOfLeafPairs(partitionNameList)
        currentPartitionNames = random.choice(partitionNameList)
        
    #7: for every partition cell:
    #8: create a room within the cell by randomly
    #   choosing two points (top left and bottom right)
    #   within its boundaries
    li_areasAreConnected = []
    terminationIterator = 0
    while (li_areasAreConnected == [False] or li_areasAreConnected == []):     
        tree.resetSubAreas()        
        tree.constructSubAreas()

        #9: starting from the lowest layers, draw corridors to connect
        #   rooms in the nodes of the BSP tree with children of the same
        #   parent
        #10:repeat 9 until the children of the root node are connected
        li_areasAreConnected = []
        tree.connectSubAreas(li_areasAreConnected)
        terminationIterator += 1
        if (terminationIterator > 50):
            print("Attempted too many iterations. Terminating.")
            print(li_areasAreConnected)
            break

    if (li_areasAreConnected == [True]):
        print(tree)
        tree.showAreaTree()

Run that thing and booooooooom.

I also used an agent-based approach in part 2 here.

Rotate 3D vector to same orientation as another 3D vector - Rodrigues' rotation formula

2015-07-08T12:50:00.000-07:00

I had a normal that I wanted to orient in a particular way, so I wanted to find the rotation necessary to orient the normal. Help from the Llama and some Googling brought me to this post at Math StackExchange and the Rodrigues' Rotation Formula.

As for background, I had a point cloud that I wanted to align to the Z axis. Since the pot of the potted plant is a relatively constant feature in these images, I wanted to use the pot to orient the rest of the cloud.

I segmented a circle out to get the border of the pot. Here's the segmented circle, floating out in space.

In the context of the Stack Exchange post, I had a normal corresponding to a point in the center of the circle that I wanted to rotate to orient with the Z-axis (0, 0, 1). Between the Stack Exchange post, Wikipedia, and the Llama, I was able to write an implementation without actually knowing anything about linear algebra (see below). Everything appears to work just fine (the cyan circle is after transformation).

This implementation uses the Eigen library and the PCL library. The nomenclature is similar to the Stack Exchange post.

// This is an implementation of Rodrigues' Rotation Formula
// https://en.wikipedia.org/wiki/Rodrigues'_rotation_formula
// Following http://math.stackexchange.com/questions/293116/rotating-one-3-vector-to-another?rq=1
// Problem: Given two 3-vectors, A and B, find the rotation of A so that its orientation matches B.
// There are some edge cases where this implementation will fail, notably if the norm of the cross product = 0.

// Step 1: Find axis (X)
Eigen::Vector3f crossProduct = vector_A.cross(vector_B);
float crossProductNorm = crossProduct.norm();
Eigen::Vector3f vector_X = (crossProduct / crossProductNorm);

// Step 2: Find angle (theta)
float dotProduct = vector_A.dot(vector_B);
float norm_A = vector_A.norm();
float norm_B = vector_B.norm();
float dotProductOfNorms = norm_A * norm_B;
float dotProductDividedByDotProductOfNorms = (dotProduct / dotProductOfNorms);
float thetaAngleRad = acos(dotProductDividedByDotProductOfNorms);

// Step 3: Construct A, the skew-symmetric matrix corresponding to X
Eigen::Matrix3f matrix_A = Eigen::Matrix3f::Identity();

matrix_A(0,0) = 0.0;
matrix_A(0,1) = -1.0 * (vector_X(2));
matrix_A(0,2) = vector_X(1);
matrix_A(1,0) = vector_X(2);
matrix_A(1,1) = 0.0;
matrix_A(1,2) = -1.0 * (vector_X(0));
matrix_A(2,0) = -1.0 * (vector_X(1));
matrix_A(2,1) = vector_X(0);
matrix_A(2,2) = 0.0;

// Step 4: Plug and chug.
Eigen::Matrix3f IdentityMat = Eigen::Matrix3f::Identity();
Eigen::Matrix3f firstTerm = sin(thetaAngleRad) * matrix_A;
Eigen::Matrix3f secondTerm = (1.0 - cos(thetaAngleRad)) * matrix_A * matrix_A;

Eigen::Matrix3f matrix_R = IdentityMat + firstTerm + secondTerm;

// This is the rotation matrix. Finished with the Rodrigues' Rotation Formula implementation.
std::cout << "matrix_R" << std::endl << matrix_R << std::endl;


// We copy the rotation matrix into the matrix that will be used for the transformation.
Eigen::Matrix4f Transform = Eigen::Matrix4f::Identity();
Transform(0,0) = matrix_R(0,0);
Transform(0,1) = matrix_R(0,1);
Transform(0,2) = matrix_R(0,2);
Transform(1,0) = matrix_R(1,0);
Transform(1,1) = matrix_R(1,1);
Transform(1,2) = matrix_R(1,2);
Transform(2,0) = matrix_R(2,0);
Transform(2,1) = matrix_R(2,1);
Transform(2,2) = matrix_R(2,2);

// Now that we have the rotation matrix, we can use it to also find the translation to move the cloud to the origin.
// First, rotate a point of interest to the new location.
Eigen::Vector3f modelVectorAxisPointTransformed =  matrix_R * modelVectorAxisPoint;

// Add the translation to the matrix.
Transform(0,3) = modelVectorAxisPointTransformed(0) * (-1.0);
Transform(1,3) = modelVectorAxisPointTransformed(1) * (-1.0);
Transform(2,3) = modelVectorAxisPointTransformed(2) * (-1.0);

// Perform the transformation
pcl::transformPointCloud(*cloudPot, *cloudPotTransformed, Transform);

And finally, an image showing all the viewports.

Desktop Screen Recording and Video Editing (Ubuntu 14.04)

2015-07-02T13:43:00.000-07:00

I wanted to do a desktop screen recording to record my new sorghum leaf segmentation prototype in action. I started with gtkRecordMyDesktop, and it worked well; however it outputs only .ogv files. I was unable to find a video editor that was happy to take .ogv files, and after multiple failed attempts at converting the .ogv files to a different format, I turned to a different screen capture software: SimpleScreenRecorder. As per the instructions on their site, this is as simple as adding the repository and installing it:

sudo add-apt-repository ppa:maarten-baert/simplescreenrecorder
sudo apt-get update
sudo apt-get install simplescreenrecorder
simplescreenrecorder

SimpleScreenRecorder was very straight forward and seemed more versatile than gtkRecordMyDesktop. It also allowed writing to multiple file formats, including .mp4. Writing to an .mp4 allowed it to be edited in OpenShot.

sudo apt-get install openshot
openshot

Openshot worked very well as video editing tool. The end result is below.

Find minimum oriented bounding box of point cloud (C++ and PCL)

2015-04-30T08:19:00.001-07:00

Here we're trying to get the minimum oriented bounding box of a point cloud using C++ and the Point Cloud Library (PCL). Most of the code originates from user Nicola Fioraio on the PCL forums in this post.

Here is how user Nicola Fioraio describes the process:
    1) compute the centroid (c0, c1, c2) and the normalized covariance
    2) compute the eigenvectors e0, e1, e2. The reference system will be (e0, e1, e0 X e1) --- note: e0 X e1 = +/- e2
    3) move the points in that RF --- note: the transformation given by the rotation matrix (e0, e1, e0 X e1) & (c0, c1, c2) must be inverted
    4) compute the max, the min and the center of the diagonal
    5) given a box centered at the origin with size (max_pt.x - min_pt.x, max_pt.y - min_pt.y, max_pt.z - min_pt.z) the transformation you have to apply is Rotation = (e0, e1, e0 X e1) & Translation = Rotation * center_diag + (c0, c1, c2)

And here is my interpretation of the process along with the code. Here's the cloud we're starting with (it's a sorghum plant):

As I understand it, we first find the eigenvectors for the covariance matrix of the point cloud (i.e. principal component analysis, PCA). You'll replace the cloudSegmented pointer with a pointer to the cloud you want to find the oriented bounding box for.

// Compute principal directions
Eigen::Vector4f pcaCentroid;
pcl::compute3DCentroid(*cloudSegmented, pcaCentroid);
Eigen::Matrix3f covariance;
computeCovarianceMatrixNormalized(*cloudSegmented, pcaCentroid, covariance);
Eigen::SelfAdjointEigenSolver<Eigen::Matrix3f> eigen_solver(covariance, Eigen::ComputeEigenvectors);
Eigen::Matrix3f eigenVectorsPCA = eigen_solver.eigenvectors();
eigenVectorsPCA.col(2) = eigenVectorsPCA.col(0).cross(eigenVectorsPCA.col(1));  /// This line is necessary for proper orientation in some cases. The numbers come out the same without it, but
                                                                                ///    the signs are different and the box doesn't get correctly oriented in some cases.
/* // Note that getting the eigenvectors can also be obtained via the PCL PCA interface with something like:
pcl::PointCloud<pcl::PointXYZ>::Ptr cloudPCAprojection (new pcl::PointCloud<pcl::PointXYZ>);
pcl::PCA<pcl::PointXYZ> pca;
pca.setInputCloud(cloudSegmented);
pca.project(*cloudSegmented, *cloudPCAprojection);
std::cerr << std::endl << "EigenVectors: " << pca.getEigenVectors() << std::endl;
std::cerr << std::endl << "EigenValues: " << pca.getEigenValues() << std::endl;
// In this case, pca.getEigenVectors() gives similar eigenVectors to eigenVectorsPCA.
*/

These eigenvectors are used to transform the point cloud to the origin point (0, 0, 0) such that the eigenvectors correspond to the axes of the space. The minimum point, maximum point, and the middle of the diagonal between these two points are calculated for the transformed cloud (also referred to as the projected cloud when using PCL's PCA interface, or reference cloud by Nicola).

// Transform the original cloud to the origin where the principal components correspond to the axes.
Eigen::Matrix4f projectionTransform(Eigen::Matrix4f::Identity());
projectionTransform.block<3,3>(0,0) = eigenVectorsPCA.transpose();
projectionTransform.block<3,1>(0,3) = -1.f * (projectionTransform.block<3,3>(0,0) * pcaCentroid.head<3>());
pcl::PointCloud<pcl::PointXYZ>::Ptr cloudPointsProjected (new pcl::PointCloud<pcl::PointXYZ>);
pcl::transformPointCloud(*cloudSegmented, *cloudPointsProjected, projectionTransform);
// Get the minimum and maximum points of the transformed cloud.
pcl::PointXYZ minPoint, maxPoint;
pcl::getMinMax3D(*cloudPointsProjected, minPoint, maxPoint);
const Eigen::Vector3f meanDiagonal = 0.5f*(maxPoint.getVector3fMap() + minPoint.getVector3fMap());

This gives us something like this; the orange cloud is the transformed cloud (shown with an axis-aligned bounding box, but we want the oriented bounding box so we're not done yet).

Finally, the quaternion is calculated using the eigenvectors (which determines how the final box gets rotated), and the transform to put the box in correct location is calculated. The minimum and maximum points are used to determine the box width, height, and depth.

// Final transform
const Eigen::Quaternionf bboxQuaternion(eigenVectorsPCA); //Quaternions are a way to do rotations https://www.youtube.com/watch?v=mHVwd8gYLnI
const Eigen::Vector3f bboxTransform = eigenVectorsPCA * meanDiagonal + pcaCentroid.head<3>();

// This viewer has 4 windows, but is only showing images in one of them as written here.
pcl::visualization::PCLVisualizer *visu;
visu = new pcl::visualization::PCLVisualizer (argc, argv, "PlyViewer");
int mesh_vp_1, mesh_vp_2, mesh_vp_3, mesh_vp_4;
visu->createViewPort (0.0, 0.5, 0.5, 1.0,  mesh_vp_1);
visu->createViewPort (0.5, 0.5, 1.0, 1.0,  mesh_vp_2);
visu->createViewPort (0.0, 0, 0.5, 0.5,  mesh_vp_3);
visu->createViewPort (0.5, 0, 1.0, 0.5, mesh_vp_4);
visu->addPointCloud(cloudSegmented, ColorHandlerXYZ(cloudSegmented, 30, 144, 255), "bboxedCloud", mesh_vp_3);
visu->addCube(bboxTransform, bboxQuaternion, maxPoint.x - minPoint.x, maxPoint.y - minPoint.y, maxPoint.z - minPoint.z, "bbox", mesh_vp_3);

And the final product is the bounding box around the original blue cloud.

Postmortem: RIG: Recalibration and Interrelation of Genomic Sequence Data with the GATK

2015-03-09T08:38:00.001-07:00

As mentioned in the previous post, the Llama and I recently got a workflow for using the GATK with a non-model organism published in G3: "RIG: Recalibration and Interrelation of Genomic Sequence Data with the GATK", and it's postmortem time (the code base can be found on GitHub here).

This manuscript was mostly a product of fortunate timing. Around early 2013, The Llama and I had been accepted as students in the lab we're currently in, and one of the things we were given to work on was converting restriction enzyme based reduced representation sequence data (a.k.a. Digital Genotyping or RAD-seq) to genotypes for the purpose of genetic map construction and QTL mapping. After iterating through a few ideas, we found that the software Stacks would be a good fit for our use case. We used Stacks for a little while, and its reference genome based pipeline worked fairly well for us.

Somewhere around mid to late 2013 we needed to do some variant calling from whole genome sequence data. Previously, CLC Genomics Workbench had been used by the lab for this application. However, the Llama and I go out of our way to avoid non-open source software, and we looked for alternatives. We caught wind of the GATK and its many blessings, and started to learn the ropes for whole genome sequence variant calling. For the time capsule, here's a forum post where I asked the GATK forum for advice in generating a variant set for BQSR/VQSR, and here's a forum post where I asked the Stacks forum for advice on generating a VCF file for use in recalibration with the GATK; that Stacks post was the nascent form of what ultimately evolved into the RIG workflow.

Since we had a reference genome, we transitioned away from Stacks and exclusively used GATK. This allowed us to use one tool that would produce standard file formats for calling variants from both our RR and WGS data; this made it much easier to identify the same variants from both types of data in downstream analyses. So, we started using the UnifiedGenotyper for large sample numbers of RR biparental family and association population analyses, and the HaplotypeCaller for the small number of WGS samples.

Then came the GATK 3.0 release which brought the power of the HaplotypeCaller to bear on as many samples as desired thanks to the new database of likelihoods model. By this time we had something like 6 full sequencing runs (48 lanes) of RR sequence data and were running into scaling problems trying to run the analyses on a local workstation. To get around this, we decided to invest some time porting our current analysis pipeline to use the GATK's Queue framework on a university high performance computing cluster.

For initial testing, we set up Sun Grid Engine on a single workstation (setting it as both a submission and execution host; is that weird?) to prototype pipelining in Scala with Queue and got our whole pipeline up and running beautifully on the single machine: Queue was happily Scatter-Gathering and HaplotypeCaller was churning through 100's of samples in reasonable timescales (although not fast enough that it would be sustainable to not move to a compute cluster). It was awesome.

Unfortunately it wasn't so awesome on the compute cluster. We ran into a problem where Queue didn't interact with the resource manager (Univa Grid Engine) nicely, and the job submissions slowed to a crawl over time:

...

After much consternation, some forum posts, and lots of things that didn't work, we eventually threw in the towel. While I suspect that figuring it out would have been instructive from a computing standpoint, there were data that needed to be analyzed. We figured it would be safer to do a rewrite in Bash that we knew would work rather than try to debug the interaction between our Queue pipeline and the cluster architecture, so we threw away the Scala code and rewrote the pipeline in Bash. Unfortunately, we had to recreate some of what made Queue so valuable, including logging and dependency tracking, but at least it was working on the cluster.

Around that time, we were designing the first iterations of what would become the RIG workflow. Back then it was internally called the LERRG workflow (Leveraging Existing Resources for Recalibration with the GATK). Say it out loud.

LERRG.

It sounds like something you might cough up in the morning, and it looked similarly awful:

The basic premise was to use variants called from RR data to inform VQSR of WGS data, and then to use the recalibrated WGS variants for BQSR of the WGS reads, followed by another round of VQSR after variant calling from the BQSRed WGS reads.

Once we had the idea and code in place, it was just a matter of execution. As expected, the workflow works pretty well; the GATK's methods are solid so it would have been surprising if it didn't work. It also worked just fine to recalibrate the RR reads and variants.

We started to kick around the idea of submitting the workflow as a manuscript since it looked like there was a market for it given that the GATK forum often has questions on how to use the GATK in non-model organisms (e.g. here and here), and the workflow was too complex to try to keep describing it in the methods sections of other papers from the lab. So, we wrote up the manuscript; during internal revisions it got renamed to RIG: Recalibration and Interrelation of genomic sequence data with the GATK. It also took considerable effort on the Llama's part to convince me that the diagram I had constructed for LERRG was, in fact, terrible; I doubt the manuscript would have ever been accepted had she not intervened and redesigned the thing from the ground up as we were working on the manuscript.

After a rapid reject from Molecular Plant, we targeted G3 since the review process there has been fair and reasonably fast in the past, and G3's readership seems to have a good number of communities that were in similar situations to ours: genomics labs working on organisms with a reference genome and a sizeable chunk of sequence data. The reviews we got back suggested testing the workflow in a better characterized system, and so we benchmarked it in Arabidopsis with good results.

What worked:
Timing: This was mostly a case of being in the right place at the right time. The GATK's 3.0 release occurred just as we were scaling up and in position to take full advantage of it, and our internal analysis pipeline hadn't been locked in so we were free to test out different options. The GATK had already been widely adopted by the human genomics community, and interest was growing in other communities in applying its benefits.

Prototyping, Refactoring, and Iteration: This was a relatively small scale project without many cooks in the kitchen, and many iterations of prototyping, testing, and redesigning or refactoring worked well. I doubt it would work well on a larger project, but the agility let us learn and improve things as we went.

What didn't work:
Development environment didn't represent deployment environment: We wasted a good chunk of time trying to debug why the Queue pipeline we wrote worked locally with Sun Grid Engine but not on the compute cluster with Univa Grid Engine. My lack of understanding of the internals of the resource manager and Queue prevented me from ever getting to the bottom of it, but we would have known about it much sooner if we had started developing and testing on the compute cluster where we would actually be running the analyses. Fortunately, we still learned something about resource managers by setting up SGE locally and learned a bit about Scala from the prototype.

Slow test cases: I still haven't generated any rapid test cases for testing and development of the workflow. Unfortunately, many of the problems arise mostly as a consequence of scale (e.g. keeping too many intermediate files consume hard drive space allocation before the pipeline finishes; file handle limit too small on the compute cluster; non-deterministic thread concurrency crashes with parallelized HaplotypeCaller). I haven't found a way to test for these without running an analysis of a full dataset, and that can require hours before the crash occurs; simple debugging takes days sometimes.

Postmortem: Resolution of Genetic Map Expansion Caused by Excess Heterozygosity in Plant Recombinant Inred Populations

2015-02-13T14:34:00.002-08:00

The Llama and I just got a manuscript through review on the RIG workflow, and it's a good time for some postmortems. This postmortem is about Truong et al. (2014) Resolution of Genetic Map Expansion Caused by Excess Heterozygosity in Plant Recombinant Inbred Populations.

The story for this one starts when the Llama and I received about 6 lanes worth of sequence data for an experimental sorghum cross composed of 400+ individuals. We got the sequence data processed to genotypes using a early iteration of the RIG workflow (at that time it had a weirder more awesomer moniker: the LERRG workflow), and we were trying to construct a genetic map using the 10,000+ genetic variants that had been called using R/qtl. However, regardless of what we tried, we kept getting massive genetic maps that were well over 100% larger than what we expected. We even posted a question to the R/qtl forums.

As hashed out in that forum thread, it turns out that something generates what manifests as tight double crossovers in the genotype data. This phenomenon seems to occur with multiple types of genotyping technologies, so it's unclear as to whether these are biologically real or artifacts of the technology. Removing these tight double crossovers shrinks the map to more reasonable sizes, but we were still left with an unexpectedly large map. Nearing the end of our rope, we continued to scour the literature for causes of genetic map expansion. We came across a publication by Knox and Ellis (2002) titled "Excess Heterozygosity Contributes to Genetic Map Expansion in Pea Recombinant Inbred Populations"; our population also displayed excess levels of heterozygosity relative to a Mendelian model. While the Knox and Ellis paper didn't give conclusive reasonings as to why it caused genetic map expansion, it provided enough precedent to try to pursue it further.

The Llama went back to the old Haldane and Waddington models and some newer models on differential zygotic viability, and she cranked out a general solution for the genotypic frequencies expected for a recombinant inbred line population that hadn't yet gone to fixation and that had differential fitness of heterozygotes. I chucked it into the right spot in the R/qtl C code for genetic map estimation, and it worked out just fine. Or at least that's how we would have liked for it to have gone. The ride wasn't quite so smooth.

Getting things working involved a large number of false starts because of things like the general solution wasn't quite right, or the implementation wasn't quite right. It became somewhat of an obsession to make it work, and we'd spend hours after we got home at night testing, fixing, and rebuilding. We eventually got it worked out in a San Francisco airport on our way home from the Keystone Symposia on Big Data in Biology: we had a general solution and an implementation that worked for the test cases.

As predicted by Knox and Ellis, the excess heterozygosity did indeed expand the genetic map. However, it turned out that excess heterozygosity doesn't always expand the map; it depends on the generation interval and the amount of heterozygosity. Ultimately, the take home message is that tight double crossovers are a major source of genetic map expansion, and that segregation distortion can also lead to genetic map expansion.

We wrote the manuscript up, and after a couple of rapid rejects from PLOS Genetics and Genetics, G3: Genes|Genomes|Genetics sent it off for review. The G3 reviewers and editor were fair and informed. One reviewer suggested that we do a simulation study to demonstrate the method's efficacy, and, in hindsight, we're grateful for that suggestion.

What worked:
Test cases - The only way we got the method working was that we finally made test cases with known output for the expected genotype probabilities given a generation interval and heterozygosity levels. Having this made it much easier to diagnose problems.
G3 - The associate editor and reviewers gave the manuscript a fair and complete review. They picked out reasonable weaknesses, and made useful suggestions that improved the paper.
R\qtl - This work almost certainly could not have been done if Karl Broman (and colleagues) did not make the R\qtl code base available on GitHub. R\qtl may not be the prettiest code base around, but it was sufficiently documented that we could find our way around and bolt our pieces onto it. R\qtl has been chugging along for more than 10 years, and we hope it chugs along for many more.

What didn't work:
Not making test cases earlier - We probably could have saved ourselves a few weeks of consternation had we initially set up test cases. Guess what we did in the beginning - We took the implementation, tossed the 10,000+ markers in, and estimated the whole map to see if the size changed. When that didn't work, we tried it for individual chromosomes. Then we tried it for subsets of individual chromosomes. Then we tried it for pairs of markers. It wasn't until around the time of the Big Data conference that we wised up and got to the root of the output: the genotype probabilities. Only then, when we knew the expected probabilities for a given generation interval and heterozygote advantage, were we able to properly debug the issue.

OpenCV Mat types

2014-10-31T15:04:00.001-07:00

I'm currently working on reading in a depth image with OpenCV and converting it to a point cloud. After wrestling with data types for a while, I came upon a tremendously useful listing of the OpenCV Mat types and their corresponding numerical values, along with the data types corresponding to those map types. For example, a CV_16U OpenCV Mat outputs 2 with the type() member function, and stores the channel data as unsigned shorts. The mapping is posted on this blog here.

E.g.:

    cv::Mat depthImage;
    depthImage = cv::imread("banana_1_1_1_depth.png", CV_LOAD_IMAGE_ANYDEPTH | CV_LOAD_IMAGE_ANYCOLOR ); // Read the file

    if(! depthImage.data )                              // Check for invalid input
    {
        std::cout <<  "Could not open or find the image" << std::endl ;
        return -1;
    }

    std::cout << "Size of depthImage:\t" << depthImage.size() << std::endl;
    std::cout << "Type of depthImage:\t" << depthImage.type() << std::endl;
    std::cout << "Depth image at coord 0,0:\t" << depthImage.at<ushort>(0,0) << std::endl;

Installing Point Cloud Library and configuring Code::Blocks to compile (Ubuntu 14.04 LTS)

2014-10-30T14:16:00.000-07:00

Most of the credit for this post goes to this blog post and this documentation at the Point Cloud Library website. However, those didn't quite get me to the finish line, so here's what got me there:

In anticipation of getting a Microsoft Kinect for Windows v2 to image some plants, I wanted to get the Point Cloud Library (PCL) up and running so that we could define 3D entities from the 2D RGB-D images we'll get from the Kinect. I'll be developing in C++ with Code::Blocks so my first goal was to compile some code using the PCL.

Mercifully, prebuilt binaries for the PCL are available through the apt package manager. As documented at the PCL website:

sudo add-apt-repository ppa:v-launchpad-jochen-sprickerhof-de/pcl
sudo apt-get update
sudo apt-get install libpcl-all

As for configuring Code::Blocks, I followed a blog post at this site, but it seemed that some of the shared object libraries requirements had changed since that blog post. Ultimately, it required adding the following:
Go to Project->Build Options->Search Directories Tab->Compiler Tab and add the following locations:

/usr/include/eigen3
/usr/include/pcl-1.7
/usr/include/pcl-1.7/pcl/surface
/usr/include/vtk-5.8

Then in Project->Build Options->Linker Settings Tab add the following link libraries:

/usr/lib/libvtk* (all of the libvtk .so files)
/usr/lib/libpcl* (all of the libpcl .so files)
/usr/lib/i386-linux-gnu/libboost_thread.so 
/usr/lib/i386-linux-gnu/libpthread.so 
/usr/lib/i386-linux-gnu/libboost_filesystem.so 
/usr/lib/i386-linux-gnu/libboost_iostreams.so 
/usr/lib/i386-linux-gnu/libboost_system.so

These files may not be in these locations on all systems, so use find to track them down, e.g.:

find /usr/lib* -name "*boost*.so"

The code I used to test if I could compile was obtained here from the Point Cloud Library, and it's also reproduced here:

#include <iostream>
#include <pcl/io/pcd_io.h>
#include <pcl/point_types.h>

int main (int argc, char** argv)
{
  pcl::PointCloud<pcl::PointXYZ> cloud;

  // Fill in the cloud data
  cloud.width    = 5;
  cloud.height   = 1;
  cloud.is_dense = false;
  cloud.points.resize (cloud.width * cloud.height);

  for (size_t i = 0; i < cloud.points.size (); ++i)
  {
    cloud.points[i].x = 1024 * rand () / (RAND_MAX + 1.0f);
    cloud.points[i].y = 1024 * rand () / (RAND_MAX + 1.0f);
    cloud.points[i].z = 1024 * rand () / (RAND_MAX + 1.0f);
  }

  pcl::io::savePCDFileASCII ("test_pcd.pcd", cloud);
  std::cerr << "Saved " << cloud.points.size () << " data points to test_pcd.pcd." << std::endl;

  for (size_t i = 0; i < cloud.points.size (); ++i)
    std::cerr << "    " << cloud.points[i].x << " " << cloud.points[i].y << " " << cloud.points[i].z << std::endl;

  return (0);
}

And hopefully you get the output:

Code::Blocks Son of Obsidian theme

2014-10-30T10:59:00.003-07:00

----
Edit: 09/10/2015 -
I had some trouble getting back to the Code Blocks forum link to get the Son of Obsidian .conf file. Here is a link that you can get the .conf file from what I think is the original author.
----

Working in Code::Blocks? Might as well make it easy on the eyes.

In this post at the Code Blocks forums, a user provides a port of the Visual Studio color scheme Son of Obsidian as a .conf file. The .conf file can be imported using cb_share_config as detailed in this blog post. Importing the .conf file to the default.conf file gives the colors in the image above.

BibTeX .bst file generation

2014-10-16T12:27:00.002-07:00

Not all of the life science journals we submit to support $\LaTeX$ and friends, so getting the right bibliography formatting can be a challenge sometimes. I just discovered a web application that allows you to answer a series of questions about the citation formatting, after which it spits out a .dbj file that can be converted to a .bst bibliography style file for BibTeX. It's appropriately called the Bst generator, and can be found at http://www.podoblaz.net/cml/index.php?id=39 .

Recurrent artificial neural network evolution using a genetic algorithm for reverse engineering of gene regulatory networks - Postmortem

2014-08-25T13:59:00.005-07:00

During the SIAM Conferences on the Life Sciences poster session, one of my poster neighbors was a couple (who also worked in the same lab!); their poster was on a method to evolve an artificial neural network using a genetic algorithm to reverse engineer a gene regulatory network given expression data. This was the first time I had heard of such a concept, and it sounded like a great idea. Once the Llama and I returned from the conference, I started work on implementing such a method.

I started with the poster authors' original paper, Kordmahalleh and Sefidmazgi et. al. (2014), but I was unable to determine how to apply the methods in that paper to reverse engineering of gene regulatory networks. I eventually ended up on a paper by Noman, Palafox, and Iba (2013) using a recurrent neural network model that was sufficiently detailed for me to get a handle on how such a model and genetic algorithm might be implemented. While this got me about 80% of the way to a successful artificial neural network and genetic algorithm, the method by which they evaluated the fitness function for the genetic algorithm wasn't described in enough detail for me to implement. My interpretation of the paper was that their method evaluated the fitness of each node independently of the other nodes, but I was unable to determine how they actually evaluated a node's output independently of the output of the other nodes at a given time point. The implementations I tried failed to reconstitute the synthetic network that generated the input data.

After that, I abandoned the Noman, Palafox, and Iba (2013) subproblem method and tried solving the entire system of equations for the network. Coming into this, I didn't know a thing about solving differential equations; one of the reasons I went for the Noman et al. (2013) method was that I perceived it wouldn't require the solution of a system of differential equations. Fortunately for me, the Llama gave me plenty of guidance. Once I had gotten my bearings in Matlab, I tried to find a C++ library that could numerically solve a system of ordinary differential equations. Thankfully, there is a nice library called odeint, and odeint has been incorporated into the Boost libraries. After grabbing the Boost libraries, I could solve for the entire system of ordinary differential equations given network parameters, and use the mean squared error to determine the fitness of the network. With the network fitness in hand, I was in business. As it turns out, this method is very similar to the one described in Wahde and Hertz (2000).

The code base can be found at the ANN_GA_prototype repo on my Github. To date, if a realistic amount of input data is provided, it can vaguely reproduce a synthetic network's output after about 100,000 generations, but the topology is very different from the target synthetic network. It may be the case that more input data is required than is realistically available from most biological experiments to accurately reverse engineer network topology.

And now for the postmortem:

What worked:

Code Complete: the Llama and I have been reading Code Complete as part of a weekly meeting we attend with a research group from the Mathematics department. When I first started writing the code, I tried to pull in some of the ideas from Code Complete, such as using abstraction to manage complexity and hiding implementation details. This worked great in the beginning since I had spent some time on design prior to writing any code. However, I ultimately failed to maintain Code Complete's tenets since I changed methodology in the middle (see more in What didn't work).

Object Oriented Programming: I usually don't work on code that is big enough to merit much OOP, but the abstraction helped a lot here.

odeint: While odeint wasn't as easy to use as Matlab, it was considerably faster at numerical solution of the system of ordinary differential equations than Matlab. However, that doesn't mean it wasn't easy to use; I don't know a thing about solving ordinary differential equations, so I think that's a testament to its usability.

What didn't work:

Changing design midstream: the code base degraded quite a bit when I switched to implementing a different method to evolve the network. Basically, the original design was to operate on a population of nodes (as I had interpreted from Noman et al. (2013)). However, when this didn't work and I had to switch to operating on a population of networks, I was pretty frustrated and just wanted to get to a working solution. Now the code base is littered with remnants and re-tooled duplicates of the original method that need to be cleaned out. However, I think this was the best path to take since I could have potentially wasted time designing for a second method that also may not have worked. Now that it's working, I can go back and clean things up.

Defining the system of equations for odeint at runtime: There must be a better way to do this, but as it stands, I define the network size (i.e. the number of equations in the system that odeint solves) as a macro. I have been unable to work out how to define the network size at runtime in a way that odeint will take. The result is that I have to change the NETWORK_SIZE macro and recompile if I want to modify the number of equations in the system that odeint solves for. I'd much rather be able to do this at runtime based on the number of genes in the input file.

Split string on delimiter in C++

2014-08-15T06:26:00.000-07:00

Some suggest using the Boost libraries for this, but there's an answer to this question on StackOverflow by user Evan Teran (albeit not the accepted answer) that I much prefer because it uses the standard libraries.

As posted on StackOverflow:

The first puts the results in a pre-constructed vector, the second returns a new vector.

std::vector<std::string> &split(const std::string &s, char delim, std::vector<std::string> &elems) {
    std::stringstream ss(s);
    std::string item;
    while (std::getline(ss, item, delim)) {
        elems.push_back(item);
    }
    return elems;
}


std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, elems);
    return elems;
}

As noted on the StackOverflow post, it can be used like:

std::vector<std::string> x = split("one:two::three", ':');

This overloads the split function, and the second function requires the first, so you need both. As a secondary example, here is a version that I slightly expanded for readability and used as a member function of a class called TimeSeriesSetDataFileParser:

std::vector<std::string> & TimeSeriesSetDataFileParser::SplitString(const std::string &inputString, char delimiter, std::vector<std::string> &elements) {
    std::stringstream sstream(inputString);  //Taken from http://stackoverflow.com/questions/236129/how-to-split-a-string-in-c
    std::string element;
    while (std::getline(sstream, element, delimiter)) {
        elements.push_back(element);
    }
    return elements;
}

std::vector<std::string> TimeSeriesSetDataFileParser::SplitString(const std::string &inputString, char delimiter) {
            std::vector<std::string> elements;   //Taken from http://stackoverflow.com/questions/236129/how-to-split-a-string-in-c
            this->SplitString(inputString, delimiter, elements);
            return elements;
}

And I used it to parse a data file to a 2D vector of strings (the member functions Good() and GetLine() are just methods encapsulating the good() and getline() member functions inherited by ifstream).

    std::cout << "Parsing input data file" << std::endl;
    std::string line;
    std::vector<std::vector<std::string> > vvstr_data;
    std::vector<std::string> vstr_splitLine;

    while (inputTimeSeriesSetDataFile->Good()) {
        line = inputTimeSeriesSetDataFile->GetLine();
        if(inputTimeSeriesSetDataFile->Good()) {
            vstr_splitLine = this->SplitString(line, '\t');
            vvstr_data.push_back(vstr_splitLine);   //For now we just read it into a 2 dimensional vector of strings.
        }
    }

SIAM Conference on the Life Sciences 2014 - Joint Recap and Post Mortem

2014-08-11T15:44:00.000-07:00

This post was written by both Precocious Llama and Frogee:

We've just returned from Charlotte, North Carolina where we attended the 2014 SIAM Conference on the Life Sciences. Both of us were fortunate to have received student travel awards to present our respective posters at the conference, and between the plenary talks, the minisymposiums, the panels, and the poster session, we took away many new (at least to us) ideas.

There were two talks that we found particularly motivational. The first that we heard was during a minisymposia on Genetic and Biochemical networks by Eduardo Sontag; he told us of work on the resolving scale-invariance, the phenomena of fold-change detection, in biological networks by considering multiple time scales. He told us stories of these proposed biological models that didn't hold up to mathematical inspection with respect to scale-invariance and how by tweaking the model they uncovered an unknown component of the biochemical network (at least this was our understanding). The second talk was a plenary talk by James Collins' on synthetic biology. Collins took us through the timeline of his work in synthetic biology from modelling toggle switches to implementing them into bacteria and all the applications that they have been involved in since.

When Frogee and I discussed these two talks after the conference, we enumerated some qualities that we appreciated about these talks:

These talks struck a nice balance between the application and the mathematics used to solve the problems at hand. These were well-motivated stories.
Through their collaborations and research it was clear that they were open to learning new fields, and that by doing so they had a mastery of both the mathematics and life sciences. They didn't self-identify as mathematicians or physicists. They were scientists solving problems.
Both these speakers had phrases that began like "About 40 years ago, I was interested in ____". It's interesting to hear the insights of somebody who has been working on related problems for 40 years.
Neither of these talks are directly related to the research that we are currently pursuing, but we really enjoyed their accessibility; it reminded us why it's beneficial to make your work accessible across disciplines.

Although the conference advertised that it would provide a cross-disciplinary forum to catalyze applied mathematical research to the life sciences, it seemed that the majority of the minisymposia's presenters did not clearly bridge their theorems and algorithms to applications in life sciences, nor did they care to provide a life science motivation to their research. Multiple speakers even proclaimed that there was no real-life application to their work, but rather they were just exploring the properties of different mathematical models. Unfortunately, this caused many of the minisymposia talks to have far less impact than the plenary talks. Regardless, we were still able to take away a few ideas that are translatable to our work.

Throughout the conference, there seemed to be a strong focus on neuronal signaling and biochemical reaction networks, with cellular behavior/movement/biophysics following close behind. Cancer modeling also made a strong showing, though not nearly as dominant as we expected it to be. Gene regulatory networks had a brief gasp of a showing. We found it very surprising that genomics and genetics in general were almost entirely absent from the conference. Of note, there was one plenary talk by Oliver Jensen on plant root modeling that was relevant to the modeling that we have started pursuing in our work.

Unfortunately, there was an undercurrent of condescension towards both females and life scientists at the conference. Specifically, comments like "You should have more equations; it's what draws people in at this conference", and "Women tend to take research more personally" were particularly disappointing. We applaud one of the panelists who, during the Lee Segal forum, expressed his displeasure with a male conference attendee (who remained anonymous) with respect to a misogynistic attitude. The panelist and the attendee had been on the hotel elevator along with a female non-attendee; in response to the female non-attendee describing her position as a manager at a bank, the male conference attendee replied "Oh, so you have an easy job."

I think we took away many useful and innovative ideas from the conference. Overall, the conference was productive, and we'd like to attend again if we're given the opportunity to do so.

And now for the post-mortem.

What worked:

We extended a poster mailing tube we had scavenged from around the department using duct tape and cardboard. We had no problems carrying this on the plane as a carry-on (American Airlines).
Going to the grocery store to get some snacks for the hotel room. We actually ended up getting full-blown meal materials from the grocery store so that we could eat meals in the hotel room given the relatively short duration of the meal times.
Using the coffee pot to cook oatmeal and macaroni and cheese.
Getting some sleep. At the beginning we tried to attend every session, but the 8am to 10pm duration of the conference every day eventually wore us down. We then started prioritizing sessions so that we could get a decent night's rest.

What didn't work:

Having only a 2 hour layover between connecting flights. A delay on the first flight caused us to nearly miss the next one. 3 hours seems a reasonable buffer.

That's it. Here we are the morning after the poster session!

Ryan McCormick at poster session for SIAM LS 2014

Sandra Truong at poster session for SIAM LS 2014

sra to gzipped fastq

2014-06-26T06:18:00.005-07:00

One way to take NCBI's SRA format to a gzipped fastq file (since BWA can take a gzipped fastq as input) is to pipe the SRA Toolkit's fastq-dump to gzip and direct that to an output file. Here's an example that uses the -X flag to only take 10 spots from the SRA file to test it out:

#!/bin/bash
#Get an sra file:
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads//BySample/sra/SRS/SRS399/SRS399719/SRR771638/SRR771638.sra

#Designate where the SRAToolkit executables are
SRATOOLKITBIN=./sratoolkit.2.3.5-2-ubuntu64/bin/

#fastq-dump converts the sra format to fastq format. -X designates how many spots to convert, and -Z designates to write to stdout.
${SRATOOLKITBIN}fastq-dump -X 10 -Z SRR771638.sra | gzip > SRR771638.fasta.gz

echo "Finished compression"

#Take a peek at the gzipped file.
gunzip -c SRR771638.fasta.gz | head -n 10