Recurrent artificial neural network evolution using a genetic algorithm for reverse engineering of gene regulatory networks - Postmortem
During the SIAM Conferences on the Life Sciences poster session, one of my poster neighbors was a couple (who also worked in the same lab!); their poster was on a method to evolve an artificial neural network using a genetic algorithm to reverse engineer a gene regulatory network given expression data. This was the first time I had heard of such a concept, and it sounded like a great idea. Once the Llama and I returned from the conference, I started work on implementing such a method.
I started with the poster authors' original paper, Kordmahalleh and Sefidmazgi et. al. (2014), but I was unable to determine how to apply the methods in that paper to reverse engineering of gene regulatory networks. I eventually ended up on a paper by Noman, Palafox, and Iba (2013) using a recurrent neural network model that was sufficiently detailed for me to get a handle on how such a model and genetic algorithm might be implemented. While this got me about 80% of the way to a successful artificial neural network and genetic algorithm, the method by which they evaluated the fitness function for the genetic algorithm wasn't described in enough detail for me to implement. My interpretation of the paper was that their method evaluated the fitness of each node independently of the other nodes, but I was unable to determine how they actually evaluated a node's output independently of the output of the other nodes at a given time point. The implementations I tried failed to reconstitute the synthetic network that generated the input data.
After that, I abandoned the Noman, Palafox, and Iba (2013) subproblem method and tried solving the entire system of equations for the network. Coming into this, I didn't know a thing about solving differential equations; one of the reasons I went for the Noman et al. (2013) method was that I perceived it wouldn't require the solution of a system of differential equations. Fortunately for me, the Llama gave me plenty of guidance. Once I had gotten my bearings in Matlab, I tried to find a C++ library that could numerically solve a system of ordinary differential equations. Thankfully, there is a nice library called odeint, and odeint has been incorporated into the Boost libraries. After grabbing the Boost libraries, I could solve for the entire system of ordinary differential equations given network parameters, and use the mean squared error to determine the fitness of the network. With the network fitness in hand, I was in business. As it turns out, this method is very similar to the one described in Wahde and Hertz (2000).
The code base can be found at the ANN_GA_prototype repo on my Github. To date, if a realistic amount of input data is provided, it can vaguely reproduce a synthetic network's output after about 100,000 generations, but the topology is very different from the target synthetic network. It may be the case that more input data is required than is realistically available from most biological experiments to accurately reverse engineer network topology.
And now for the postmortem:
Code Complete: the Llama and I have been reading Code Complete as part of a weekly meeting we attend with a research group from the Mathematics department. When I first started writing the code, I tried to pull in some of the ideas from Code Complete, such as using abstraction to manage complexity and hiding implementation details. This worked great in the beginning since I had spent some time on design prior to writing any code. However, I ultimately failed to maintain Code Complete's tenets since I changed methodology in the middle (see more in What didn't work).
Object Oriented Programming: I usually don't work on code that is big enough to merit much OOP, but the abstraction helped a lot here.
odeint: While odeint wasn't as easy to use as Matlab, it was considerably faster at numerical solution of the system of ordinary differential equations than Matlab. However, that doesn't mean it wasn't easy to use; I don't know a thing about solving ordinary differential equations, so I think that's a testament to its usability.
What didn't work:
Changing design midstream: the code base degraded quite a bit when I switched to implementing a different method to evolve the network. Basically, the original design was to operate on a population of nodes (as I had interpreted from Noman et al. (2013)). However, when this didn't work and I had to switch to operating on a population of networks, I was pretty frustrated and just wanted to get to a working solution. Now the code base is littered with remnants and re-tooled duplicates of the original method that need to be cleaned out. However, I think this was the best path to take since I could have potentially wasted time designing for a second method that also may not have worked. Now that it's working, I can go back and clean things up.
Defining the system of equations for odeint at runtime: There must be a better way to do this, but as it stands, I define the network size (i.e. the number of equations in the system that odeint solves) as a macro. I have been unable to work out how to define the network size at runtime in a way that odeint will take. The result is that I have to change the NETWORK_SIZE macro and recompile if I want to modify the number of equations in the system that odeint solves for. I'd much rather be able to do this at runtime based on the number of genes in the input file.