Monday, August 25, 2014

Recurrent artificial neural network evolution using a genetic algorithm for reverse engineering of gene regulatory networks - Postmortem

During the SIAM Conferences on the Life Sciences poster session, one of my poster neighbors was a couple (who also worked in the same lab!); their poster was on a method to evolve an artificial neural network using a genetic algorithm to reverse engineer a gene regulatory network given expression data. This was the first time I had heard of such a concept, and it sounded like a great idea. Once the Llama and I returned from the conference, I started work on implementing such a method.

I started with the poster authors' original paper, Kordmahalleh and Sefidmazgi et. al. (2014), but I was unable to determine how to apply the methods in that paper to reverse engineering of gene regulatory networks. I eventually ended up on a paper by Noman, Palafox, and Iba (2013) using a recurrent neural network model that was sufficiently detailed for me to get a handle on how such a model and genetic algorithm might be implemented. While this got me about 80% of the way to a successful artificial neural network and genetic algorithm, the method by which they evaluated the fitness function for the genetic algorithm wasn't described in enough detail for me to implement. My interpretation of the paper was that their method evaluated the fitness of each node independently of the other nodes, but I was unable to determine how they actually evaluated a node's output independently of the output of the other nodes at a given time point. The implementations I tried failed to reconstitute the synthetic network that generated the input data.

After that, I abandoned the Noman, Palafox, and Iba (2013) subproblem method and tried solving the entire system of equations for the network. Coming into this, I didn't know a thing about solving differential equations; one of the reasons I went for the Noman et al. (2013) method was that I perceived it wouldn't require the solution of a system of differential equations. Fortunately for me, the Llama gave me plenty of guidance. Once I had gotten my bearings in Matlab, I tried to find a C++ library that could numerically solve a system of ordinary differential equations. Thankfully, there is a nice library called odeint, and odeint has been incorporated into the Boost libraries. After grabbing the Boost libraries, I could solve for the entire system of ordinary differential equations given network parameters, and use the mean squared error to determine the fitness of the network. With the network fitness in hand, I was in business. As it turns out, this method is very similar to the one described in Wahde and Hertz (2000).

The code base can be found at the ANN_GA_prototype repo on my Github. To date, if a realistic amount of input data is provided, it can vaguely reproduce a synthetic network's output after about 100,000 generations, but the topology is very different from the target synthetic network. It may be the case that more input data is required than is realistically available from most biological experiments to accurately reverse engineer network topology.



And now for the postmortem:

What worked:
Code Complete: the Llama and I have been reading Code Complete as part of a weekly meeting we attend with a research group from the Mathematics department. When I first started writing the code, I tried to pull in some of the ideas from Code Complete, such as using abstraction to manage complexity and hiding implementation details. This worked great in the beginning since I had spent some time on design prior to writing any code. However, I ultimately failed to maintain Code Complete's tenets since I changed methodology in the middle (see more in What didn't work).

Object Oriented Programming: I usually don't work on code that is big enough to merit much OOP, but the abstraction helped a lot here.

odeint: While odeint wasn't as easy to use as Matlab, it was considerably faster at numerical solution of the system of ordinary differential equations than Matlab. However, that doesn't mean it wasn't easy to use; I don't know a thing about solving ordinary differential equations, so I think that's a testament to its usability.

What didn't work:
Changing design midstream: the code base degraded quite a bit when I switched to implementing a different method to evolve the network. Basically, the original design was to operate on a population of nodes (as I had interpreted from Noman et al. (2013)). However, when this didn't work and I had to switch to operating on a population of networks, I was pretty frustrated and just wanted to get to a working solution. Now the code base is littered with remnants and re-tooled duplicates of the original method that need to be cleaned out. However, I think this was the best path to take since I could have potentially wasted time designing for a second method that also may not have worked. Now that it's working, I can go back and clean things up.

Defining the system of equations for odeint at runtime: There must be a better way to do this, but as it stands, I define the network size (i.e. the number of equations in the system that odeint solves) as a macro. I have been unable to work out how to define the network size at runtime in a way that odeint will take. The result is that I have to change the NETWORK_SIZE macro and recompile if I want to modify the number of equations in the system that odeint solves for. I'd much rather be able to do this at runtime based on the number of genes in the input file.

Friday, August 15, 2014

Split string on delimiter in C++

Some suggest using the Boost libraries for this, but there's an answer to this question on StackOverflow by user Evan Teran (albeit not the accepted answer) that I much prefer because it uses the standard libraries.

As posted on StackOverflow:
The first puts the results in a pre-constructed vector, the second returns a new vector.
std::vector<std::string> &split(const std::string &s, char delim, std::vector<std::string> &elems) {
    std::stringstream ss(s);
    std::string item;
    while (std::getline(ss, item, delim)) {
        elems.push_back(item);
    }
    return elems;
}


std::vector<std::string> split(const std::string &s, char delim) {
    std::vector<std::string> elems;
    split(s, delim, elems);
    return elems;
}
As noted on the StackOverflow post, it can be used like:
std::vector<std::string> x = split("one:two::three", ':');
This overloads the split function, and the second function requires the first, so you need both. As a secondary example, here is a version that I slightly expanded for readability and used as a member function of a class called TimeSeriesSetDataFileParser:
std::vector<std::string> & TimeSeriesSetDataFileParser::SplitString(const std::string &inputString, char delimiter, std::vector<std::string> &elements) {
    std::stringstream sstream(inputString);  //Taken from http://stackoverflow.com/questions/236129/how-to-split-a-string-in-c
    std::string element;
    while (std::getline(sstream, element, delimiter)) {
        elements.push_back(element);
    }
    return elements;
}

std::vector<std::string> TimeSeriesSetDataFileParser::SplitString(const std::string &inputString, char delimiter) {
            std::vector<std::string> elements;   //Taken from http://stackoverflow.com/questions/236129/how-to-split-a-string-in-c
            this->SplitString(inputString, delimiter, elements);
            return elements;
}
And I used it to parse a data file to a 2D vector of strings (the member functions Good() and GetLine() are just methods encapsulating the good() and getline() member functions inherited by ifstream).
    std::cout << "Parsing input data file" << std::endl;
    std::string line;
    std::vector<std::vector<std::string> > vvstr_data;
    std::vector<std::string> vstr_splitLine;

    while (inputTimeSeriesSetDataFile->Good()) {
        line = inputTimeSeriesSetDataFile->GetLine();
        if(inputTimeSeriesSetDataFile->Good()) {
            vstr_splitLine = this->SplitString(line, '\t');
            vvstr_data.push_back(vstr_splitLine);   //For now we just read it into a 2 dimensional vector of strings.
        }
    }

Monday, August 11, 2014

SIAM Conference on the Life Sciences 2014 - Joint Recap and Post Mortem

This post was written by both Precocious Llama and Frogee:

We've just returned from Charlotte, North Carolina where we attended the 2014 SIAM Conference on the Life Sciences. Both of us were fortunate to have received student travel awards to present our respective posters at the conference, and between the plenary talks, the minisymposiums, the panels, and the poster session, we took away many new (at least to us) ideas.

There were two talks that we found particularly motivational. The first that we heard was during a minisymposia on Genetic and Biochemical networks by Eduardo Sontag; he told us of work on the resolving scale-invariance, the phenomena of fold-change detection, in biological networks by considering multiple time scales. He told us stories of these proposed biological models that didn't hold up to mathematical inspection with respect to scale-invariance and how by tweaking the model they uncovered an unknown component of the biochemical network (at least this was our understanding). The second talk was a plenary talk by James Collins' on synthetic biology. Collins took us through the timeline of his work in synthetic biology from modelling toggle switches to implementing them into bacteria and all the applications that they have been involved in since.

When Frogee and I discussed these two talks after the conference, we enumerated some qualities that we appreciated about these talks:

  1. These talks struck a nice balance between the application and the mathematics used to solve the problems at hand. These were well-motivated stories.
  2. Through their collaborations and research it was clear that they were open to learning new fields, and that by doing so they had a mastery of both the mathematics and life sciences. They didn't self-identify as mathematicians or physicists. They were scientists solving problems.
  3. Both these speakers had phrases that began like "About 40 years ago, I was interested in ____". It's interesting to hear the insights of somebody who has been working on related problems for 40 years.
  4. Neither of these talks are directly related to the research that we are currently pursuing, but we really enjoyed their accessibility; it reminded us why it's beneficial to make your work accessible across disciplines.
Although the conference advertised that it would provide a cross-disciplinary forum to catalyze applied mathematical research to the life sciences, it seemed that the majority of the minisymposia's presenters did not clearly bridge their theorems and algorithms to applications in life sciences, nor did they care to provide a life science motivation to their research. Multiple speakers even proclaimed that there was no real-life application to their work, but rather they were just exploring the properties of different mathematical models. Unfortunately, this caused many of the minisymposia talks to have far less impact than the plenary talks. Regardless, we were still able to take away a few ideas that are translatable to our work.

Throughout the conference, there seemed to be a strong focus on neuronal signaling and biochemical reaction networks, with cellular behavior/movement/biophysics following close behind. Cancer modeling also made a strong showing, though not nearly as dominant as we expected it to be. Gene regulatory networks had a brief gasp of a showing. We found it very surprising that genomics and genetics in general were almost entirely absent from the conference. Of note, there was one plenary talk by Oliver Jensen on plant root modeling that was relevant to the modeling that we have started pursuing in our work.

Unfortunately, there was an undercurrent of condescension towards both females and life scientists at the conference. Specifically, comments like "You should have more equations; it's what draws people in at this conference", and "Women tend to take research more personally" were particularly disappointing. We applaud one of the panelists who, during the Lee Segal forum, expressed his displeasure with a male conference attendee (who remained anonymous) with respect to a misogynistic attitude. The panelist and the attendee had been on the hotel elevator along with a female non-attendee; in response to the female non-attendee describing her position as a manager at a bank, the male conference attendee replied "Oh, so you have an easy job." 

I think we took away many useful and innovative ideas from the conference. Overall, the conference was productive, and we'd like to attend again if we're given the opportunity to do so.

And now for the post-mortem.

What worked:

  1. We extended a poster mailing tube we had scavenged from around the department using duct tape and cardboard. We had no problems carrying this on the plane as a carry-on (American Airlines).
  2. Going to the grocery store to get some snacks for the hotel room. We actually ended up getting full-blown meal materials from the grocery store so that we could eat meals in the hotel room given the relatively short duration of the meal times.
  3. Using the coffee pot to cook oatmeal and macaroni and cheese.
  4. Getting some sleep. At the beginning we tried to attend every session, but the 8am to 10pm duration of the conference every day eventually wore us down. We then started prioritizing sessions so that we could get a decent night's rest.
What didn't work:

  1. Having only a 2 hour layover between connecting flights. A delay on the first flight caused us to nearly miss the next one. 3 hours seems a reasonable buffer.
That's it. Here we are the morning after the poster session!
Ryan McCormick at poster session for SIAM LS 2014
Sandra Truong at poster session for SIAM LS 2014