Thursday, December 12, 2013

Embedding Python in C++: converting C++ vectors to numpy arrays, and plotting C++ vector contents using matplotlib

Edit: A comment on StackOverflow from user4815162342 gave a helpful suggestion:
You really should look into using PyArray_SimpleNewFromData, as the OP proposed in the question. For larger vectors it is very important to avoid creating a Python list or tuple, because they require all of their elements to be Python objects, which is very memory- and CPU-inefficient for large vectors.

-----------------------------------------------

I haven't found a plotting library for C++ that I like as much as Python's matplotlib, but the prospect of writing data calculated in a C++ program to a file and then reading it into a Python program for plotting was not appealing. Fortunately, there's a Python/C API that enables Python to be embedded into C/C++.

I managed to cobble together code that makes two C++ vectors, converts them to Python tuples, passes them to Python, converts them to NumPy arrays, then plots them using matplotlib. There is probably a better way that folks with more experience can provide, but this seems to work.

Credit goes to these pages for helping me cobble it together:

A Makefile along with C++ and Python code can be found at my Github, but here are some of the highlights.

Here is a bit of the important stuff from the .cpp file:
     //Make some vectors containing the data
     static const double xarr[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14};
     std::vector<double> xvec (xarr, xarr + sizeof(xarr) / sizeof(xarr[0]) );
     static const double yarr[] = {0,0,1,1,0,0,2,2,0,0,1,1,0,0};
     std::vector<double> yvec (yarr, yarr + sizeof(yarr) / sizeof(yarr[0]) );

     //Transfer the C++ vector to a python tuple
     pXVec = PyTuple_New(xvec.size()); 
     for (i = 0; i < xvec.size(); ++i) {
          pValue = PyFloat_FromDouble(xvec[i]);
          if (!pValue) {
               Py_DECREF(pXVec);
               Py_DECREF(pModule);
               fprintf(stderr, "Cannot convert array value\n");
               return 1;
          }
          PyTuple_SetItem(pXVec, i, pValue);
     }

     //Transfer the other C++ vector to a python tuple
     pYVec = PyTuple_New(yvec.size()); 
     for (i = 0; i < yvec.size(); ++i) {
          pValue = PyFloat_FromDouble(yvec[i]);
          if (!pValue) {
               Py_DECREF(pYVec);
               Py_DECREF(pModule);
               fprintf(stderr, "Cannot convert array value\n");
               return 1;
          }
          PyTuple_SetItem(pYVec, i, pValue); //
     }

     //Set the argument tuple to contain the two input tuples
     PyTuple_SetItem(pArgTuple, 0, pXVec);
     PyTuple_SetItem(pArgTuple, 1, pYVec);

     //Call the python function
     pValue = PyObject_CallObject(pFunc, pArgTuple);
Here's the entire .py file:
def plotStdVectors(x, y):
    import numpy as np
    import matplotlib.pyplot as plt
    print "Printing from Python in plotStdVectors()"
    print x
    print y
    x = np.fromiter(x, dtype = np.float)
    y = np.fromiter(y, dtype = np.float)
    print x
    print y
    plt.plot(x, y)
    plt.show()
    return 0
And, after compiling with the Makefile (which is for Ubuntu 12.10 using the system's default Python installation), can be run with:
$ ./testEmbed pythonToEmbed plotStdVectors
Hello from main
Hello from runPython()
Printing from Python in plotStdVectors()
(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0)
(0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0)
[  1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.  14.]
[ 0.  0.  1.  1.  0.  0.  2.  2.  0.  0.  1.  1.  0.  0.]
Result of call: 0
Program finished
And the plot:

Thursday, December 5, 2013

Getting started with FFTW for computing the Discrete Fourier Transform

The FFTW (Fastest Fourier Transform in the West) is a library for, among other things, computing the Discrete Fourier Transform (DFT) of real data.

I have been using Scipy for computing the DFT of real data, but I have a love/hate relationship with Python so I wanted to have it working with C/C++. Additionally, my understanding is that the FFTW is supposed to have exceptional performance when computing the DFT of many arrays of the same size because it invests in an initialization step that tailors the computation to the hardware (the "plan").

The installation on Ubuntu 12.10 went smoothly after downloading the package from their website. The usual round of
./configure
make
sudo make install
took care of everything.

The FFTW documentation was very helpful and got me up and running. My code and Makefile are posted on Github. The code generates an array of real numbers and computes the DFT, returning an array of complex numbers in the first function, and it returns the "halfcomplex" format in the second function (check the FFTW documentation for more on the "halfcomplex" format):
$./fftw_test 
Hello from main

Hello from real2complex()

This takes in a real number array and returns a complex number array:
Input array: 
 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
Output array: 
 (8,0) (0,0) (0,0) (0,0) (-4,4) (0,0) (0,0) (0,0) (0,0) (0,0) (0,0) (0,0) (0,0) (0,0) (0,0) (0,0)

Hello from real2real()

This takes in a real number array and returns a real number array:
Input array
 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
"Halfcomplex" output array: 
 8 0 0 0 -4 0 0 0 0 0 0 0 4 0 0 0
Power spectrum
 0 0 0 5.65685 0 0 0
Frequencies of spectrum 
 0.0625 0.125 0.1875 0.25 0.3125 0.375 0.4375
Please let me know if you find any errors, and use the code at your own risk; my theoretical understanding of the DFT certainly isn't reliable!

Tuesday, December 3, 2013

grep: select non-matching lines

I recently discovered the -v option from grep to select non-matching lines. It simply inverts the matching of the grep command.

For example, to look through a .vcf file for lines that don't have the QD annotation:
grep -v QD variants.vcf
You can also count the abundance of such lines if you exclude the header with the -P option and use a perl regex, and also the -c option to count:
grep -v -P "^#" variants.vcf | grep -v -c QD
This post came about because I'm working with the Broad's GATK on a variant calling pipeline for an organism lacking an established variant database. While working on testing different variant hard-filters to bootstrap a "truth" set for variant calling, the GATK VariantFiltration walker threw some warnings that I wanted to look into further; I found grep's -v option from this post on the GATK forums.

Thursday, November 21, 2013

Comment out block of code in bash shell script

Credit goes to this post on Stack Overflow. I don't fully understand the syntax, but it works.
#!/bin/bash
echo "Before commented block"
: <<'END'
command 1 # Doesn't get executed
command 2 # Doesn't get executed
END
echo "After commented block"
Another commenter pointed out that the string 'END' does not have a specific meaning and any string can be used. In fact, it may be advantageous to use a different string that you know won't be present in your block of code. Much to my amusement, he suggested SNURFLE_BURGERS:
: <<'SNURFLE_BURGERS'
echo this is not executed
SNURFLE_BURGERS
I'm torn as to which naming convention to adopt.

Unarchive and decompress .tar.bz2 and .tar.gz (Ubuntu 12.04)

Archived and compressed files can be unarchived and decompressed with tar:
tar -xjvf targetfile.tar.bz2

tar -xzvf targetfile.tar.gz
Archived files can be unarchived with tar:
tar -xvf targetfile.tar
Compressed files can be decompressed with bunzip2 or gunzip:
bunzip2 targetfile.bz2

gunzip targetfile.gz

Saturday, November 16, 2013

Texas A&M ENG-LIFE Workshop + Journal Club

The Llama and I attended the first Texas A&M University ENG-LIFE Workshop: At the Interface of Engineering and Life Sciences. I think it was a successful and productive workshop, with strong attendance from multiple disciplines and a fair amount of research that I consider to be interdisciplinary.

The talks were all short, 15 minute talks on a wide range of topics, ranging from bioenergy to microfluidics to biological pathways modeled as logic circuits. I think this format was successful because it allowed researchers from different departments on campus to get a brief overview of the work going on elsewhere at the university to identify new potential collaborators.

The Llama and I were fortunate to have a brief discussion with Professor Aniruddha Datta about his work on modeling biological systems using pathway logic after his talk. I suspect the relaxed format of the conference along with the relatively small size encouraged many such introductions and discussions.

One of the presentations was from Professor Arum Han, a collaborator of one of the professors who heads our journal club, Professor Tim Devarrene, and Professor Han spoke on his work with microfluidic devices tailored for biological research. His talk was relevant to this week's journal club paper.

This week's journal club paper: In vivo lipidomics using single-cell Raman spectroscopy by Wu et al. (2011). The presenters were two members of the Devarrene lab, and so my comment was directed to them.

My comment:
"At the ENG-LIFE Workship today, one of your collaborators, Professor Arum Han, spoke on microfluidic platforms for oil production analysis. Are you all considering combining Raman spectroscopy or something similar with a microfluidics platform for high-throughput screening?"


Journal club: the trend towards rapid "high-throughput" development


Comment:
I think their method is a nice demonstration of where life science research strives to go, since it quickly samples in vivo conditions. But, I have a question concerning their proposal to scale this analysis to a rapid "high-throughput" method. It seems like their characterization is limited to what they already have in their training data, sort of like a microarray can only assay what we already know. Is this true, or did I misunderstand the technique?

Saturday, November 9, 2013

OpenGL vs Direct3D: A history

I've been searching around for some information to compare and contrast OpenGL and Direct3D (a component of DirectX). I happened upon this post at the programmers Stack Exchange. The accepted answer contributed by Nicol Bolas is a wonderfully written tale of the origins of OpenGL vs Direct3D, and how Direct3D has risen to dominance.

Here is an excerpt from the answer:
"
So, the OpenGL development environment was fractured for a time. No cross-hardware shaders, no cross-hardware GPU vertex storage, while D3D users enjoyed both. Could it get worse? 

You... you could say that. Enter 3D Labs. 

Who are they, you might ask? They are a defunct company whom I consider to be the true killers of OpenGL. Sure, the ARB's general ineptness made OpenGL vulnerable when it should have been owning D3D. But 3D Labs is perhaps the single biggest reason to my mind for OpenGL's current market state. What could they have possibly done to cause that? 

They designed the OpenGL Shading Language.
"

Friday, November 8, 2013

Journal club: One happy epigenetic family

This week's journal club paper: A SWI/SNF Chromatin-Remodeling Complex Acts in Noncoding RNA-Mediated Transcriptional Silencing by Zhu et al. (2013).

This paper was great.

My comment:
"I thought this paper was excellent. It painted a great story showing that the three major elements typically implicated in epigenetic regulation are elements of the same pathway. I don't think they made a very strong case that this regulatory pathway is important for elements other than transposons and repeats, but seeing as how transposons are such a large component of genomes, I think this work is important. This was good stuff."

Saturday, November 2, 2013

Using Emscripten to compile C++ to JavaScript on Windows 8

Small victories.

Credit goes to this post from Joshua Granick for providing the guidance for doing this.

Joshua Granick's post pointed to where the brilliant mind behind Emscripten (Alon Zakai) put up a guide for Using Emscripten on Windows.

His post also pointed to his code for drawing an image in C++ and compiling it to JavaScript found as a gist here.

Successful compilation of the code in his gist produces the image that he posted here of a red triangle on a black background.

I don't know anything about the Simple DirectMedia Layer yet, but I was able to change the colors and size of the image and display it in the browser. Below is a screenshot of the JavaScript produced image.


This means that C++ code was turned to JavaScript that was then executed in a browser. Magnificent.

Friday, November 1, 2013

Journal Club: Ugly dogs

This week's journal club paper: Integration of responses within and across Arabidopsis natural accessions uncovers loci controlling root systems architecture by Rosas et al. (2013)

This paper was like one of those dogs that's so ugly it's cute.

My comment:
"I think this is a flawed paper, but the authors' logic was so strange that I liked it. I think the idea that phenotypic plasticity is a mechanism leading to fixed genetic diversity is an odd one, and I don't think that they demonstrated that their candidate genes mediate phenotypic plasticity. However, I can't propose a better way to convincingly test their proposed hypothesis."

Journal club wonkiness

I think their hypothesis is entirely wonko as it aims to answer a question on the origin of natural variance. #ihatefrogs

This paper introduces an interesting experimental approach to identify genes that both enable phenotypic plasticity and underlie adaptive changes across natural variants. I think that the authors should have defended the significance of their Principal Component phenotypes with respect to how those phenotypes are relevant in natural settings, otherwise there isn't a good reason to actually use the PC phenotypes in the downstream analyses.



Wednesday, October 30, 2013

Communicating science without undermining its complexities: different sizes of infinity

I just came across this YouTube video that I believe explains different sizes of infinity in a very understandable way.The video does so without the introduction of diagonalization (when Professor Harold Boas first introduced this in my first "real math" class it blew my mind), which I felt was instrumental in my understanding of the different sizes of infinity.

However, this video reminds me that a critical aspect of practising science is the ability to communicate its importance without requiring extensive background knowledge. In my opinion, this video does this very well without undermining the complexity of the problem.

   

Monday, October 28, 2013

Comment and annotate PDF in Ubuntu (12.10)

I finally found a somewhat satisfactory way to comment and annotate PDF files in Ubuntu after trying a few of the solutions suggested by the interwebs. A piece of software called foxit can be run through wine.

I installed wine a while back, but I don't remember if I did anything special. If I recall correctly, I installed wine using these directions at the Ubuntu site.

After downloading the foxit exe from the foxit site, I just ran it with wine and installed using all the defaults.

wine FoxitReader606.0722_enu_Setup.exe


Oh glorious day.

Thursday, October 24, 2013

Open terminal to current working directory from nautilus (Ubuntu 12.10)

Credit goes to this Ask Ubuntu forum post.

There is a package that enables right clicking nautilus' directory window and choosing to open a terminal with the working directory set to the directory in the nautilus window.

Install the package via the package manager and then restart nautilus. 
sudo apt-get install nautilus-open-terminal
nautilus -q


Then right click and choose open in terminal.

Saturday, October 19, 2013

awk and GNU parallel: problems with quotes

Embarrassing as it is to admit, I spent about two hours trying to work out how to parallelize an awk command with GNU parallel. I think the conclusion is that I don't understand quotes in bash as well as I should.

My goal was to run the awk code from my previous post in parallel. Credit goes to this post on Stack Overflow for getting me to the solution.

One of the answers in the Stack Overflow post suggested storing the awk command in a string, and that worked for me:
#!/bin/bash

awk_body='(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)}
          (substr($0, 1, 1) == "@") || ($0 == "+") {print $0}'

ls *.fastq | parallel --gnu -j-2 --eta "cat {} | awk '$awk_body' > trimmed{}"

Kudos to Ole Tange for GNU parallel.

Edit:

Ole Tange stopped by and left some good pointers in the comments:

The GNU Parallel man page has a section dedicated to quoting: http://www.gnu.org/software/parallel/man.html#quoting

Often you can simply add \' around every '.

Or use --shellquote to generate a quoted version of your string.

For readability you might want to look into writing a Bash function instead: http://www.gnu.org/software/parallel/man.html#example__calling_bash_functions

Thanks for the suggestions, Ole!


Friday, October 18, 2013

Journal club: Ontologies to hypotheses. UhLAWLNo

The design for data generation and the actual data generated seem useful and adds to the field of characterizing totipotent cells. The authors' build bridges from gene ontologies of hierarchical clusters to hypotheses with little to no support in between, for example when they found that ribosomal protein genes were enriched in four clusters, implying differential expression, they hypothesize that "ribosomal equipment needs to be precisely regulated during re-entry in to the cell cycle". But, what if this is just peripheral information that is non-essential to the gene state, and that even when it is not precisely regulated the cell can still enter the cell cycle. The analyses seems peppered with weakly supported hypotheses.


Comment
I think the authors' story was poised to take better advantage of the many similar data sets generated by other labs. And, I think that had they made better use of the other datasets, they could have
  1. potentially given their hypotheses more support, and 
  2. I think more importantly, make a stronger case for their dataset's contribution to the field.
Edit
During journal club, another student asked why they didn't collect meristem cells if that is essentially the state they are trying to assay; and it dawned on me that these protoplasts may be an "erased" state, and have no tissue identity, not even as quiescent cells. This leads back to Dr. Maggert's question, with respect to epigenetics, if there is a temporal/spatial difference between "erasure" and "establishment", and I think that this could be an example of a cell that has been "erased", but not yet "established". I think that Dr. Maggert would 100% argue that this event falls into the category of development, which he defines as mutually exclusive to epigenetic. 

I appreciate Dr. Devarrene's insights in Journal club. I think they add value to learning how to critique science.

Journal club: On data generation and analysis

This week's paper: Characterization of the early events leading to totipotency in an Arabidopsis protoplast liquid culture by transcript profiling by Chupeau et al. (2013)

I think the experiment is a good idea, and the data could be informative; unfortunately I don't think the authors had the expertise to analyze the data they generated. They probably would have been better off adding another author or two and passing the data to a collaborator with more bioinformatic experience; they may have been able to make a more cogent story.

My observation has been that some (many?) life scientists are quite antagonistic towards bioinformaticians/computational scientists; I've also met some on the computing side of the fence who are quite antagonistic towards the life scientists. The life scientists think the computional folk just push buttons and use meaningless statistics or models. The computional folk think the life scientists are just observational scientists who don't appreciate scientific rigor.

My comment:
"I think the experiment behind this paper is worthwhile and that the methods and data will be useful. However, I also think that the paper demonstrates a common issue in modern genomics research, which is that our data generation capacity frequently outpaces our technical ability to analyze the data in meaningful ways."

Tuesday, October 15, 2013

Modeling of convection

The Llama and I consider Professor Wolfgang Bangerth to be one of our intellectual and professional role models. He allows the Llama and I to attend his research group meetings, and he's a member of my advisory committee. He just posted a beautiful video depicting convection solved using the ASPECT and deal.ii software libraries that he develops for.

   

The "About" section from the video:
"Convection in a box with heated bottom and cooled top. Solved with the finite element code ASPECT (http://aspect.dealii.org) using the 3dbox.prm input file set up by Timo Heister. Simulation by Wolfgang Bangerth on 64 cores. The simulation has ~10M unknowns and runs for ~10,000 time steps. Further discussion in the Aspect manual."

Sunday, October 13, 2013

Code golf

I recently discovered code golf. Code golf is basically a competition to write a program that solves a given problem in the minimum number of characters (or memory) possible. In general, a problem is presented and then the golfers propose their solutions with character counts. I think that the formal competitions are limited to particular languages, but the code golf on the Stack Exchange usually has solutions in many different languages.

I haven't golfed myself, but reading the golfing of others always gives me a good larf. I particularly enjoy the C/C++ golfers.  As an example, this post on the Stack Exchange has an C++ implementation of Dijkstra's Algorithm for pathfinding in 799 characters.

Friday, October 11, 2013

Journal Club: Is statistical significance always biological significance?

Edit: The journal club presenter stated that the moths lay ~6 to 8 eggs, so perhaps the 1 egg difference mentioned below is meaningful.

This weeks journal club article: Feeding-induced rearrangement of green leaf volatiles reduces moth oviposition from Allmann et al. (2013).

Today marks the last green leaf volatile paper for the semester.

This was the first time I've read a paper from the eLIFE journal. I think I liked the journal's layout. eLIFE has made a bit of a splash, and I think it's one of many that will gain popularity as the open access movement continues to gain momentum. I expect that the future of scientific literature will be more of an open-access, crowd-sourced effort where peer review is done in the comments section of a publication by hundreds of scientists; search engines will allow relevant papers to be searched for, and crowd review efforts will determine what is accurate and important and let the garbage filter to the bottom.

Will this be better than what we have now? I have no idea. I just don't believe that the traditional publication system is capable of disseminating the tremendous volume of information that the world's research institutions can now generate.

As for the paper, the authors reported many statistically significant differences for the differential production of the green leaf volatile (GLV) isoforms in response to the M. sexta oral secretions (OS), and they reported that a behavioral difference in oviposition occurred because of reception of one of the particular GLV isoforms. While I don't doubt the differences they found were statistically significant, I'm not convinced that they are actually of any biological significance.

In their field trials, the same experiments were not statistically significant in the first trial, but they were in the second (although the visual representation of the data aren't very convincing that a difference exists for the second trial).

The calcium activity in the antennal lobe data were more convincing that there is a difference in perception of the two isoforms, although it doesn't demonstrate that a difference in perception translates to a behavioral difference.

Finally, I think Figure 7 was supposed to demonstrate the key point of the paper: behavioral changes based on differential perception of the two isoforms. However, while they did again see significant differences, it looks like there was only ever an average difference of one egg oviposited between the two treatments. I think it's potentially misleading that they're testing that $\mu_1 - \mu_2 \neq 0$ rather than testing $\mu_1 \neq \mu_2$ since, as reported, I have no clue how many eggs these things lay. Is it a difference between 100 and 101? or 2 and 3? As such, I'm not very convinced that this paper has shown anything new.

 My comment:
"I thought this paper's hypothesis was fun, but I didn't find the data very convincing. If I'm reading it correctly, Figure 7 reports that there is an average difference of just one egg between the two isoform treatments. I have no idea how many eggs these moths lay, and I'm not convinced that the one egg difference is demonstrative of a behavioral difference as a result of differential interpretation of the two isoforms."

Journal club: neuroscience entomology

Feeding-induced rearrangement of green leaf volatiles reduces moth oviposition

The author's painted a very intriguing dynamic of green leaf volatiles (GLV). Typically, from my sample size of 2 other GLV papers from journal club, I think of GLV as chemicals released by the plant in response to damage in order to recruit predators of the organism causing damage.

This paper takes a slight twist in that the GLV is constitutively released by the plant, and it is the caterpillar that changes the isometry of the GLV which in turn attracts the caterpillar's predator! Moreover, the new isomer is also detected by the moth (caterpillar's next life stage) to oviposit (lay eggs) on a different plant as it is more likely that the new isomer indicates predators that could eat the moth's eggs.

Unfortunately, I'm not sure the authors found enough supporting evidence. While they found that the antennal lobe seems to be able to differentiate different isomers, they didn't convincingly show that it was consequential, changes in ovipositon due to different isomer ratios.

Comment:
With respect to the calcium responses in the antennal lobes, the authors mention that they cannot directly compare individuals due to not having a physiological map of the antennal lobe. Given that, I couldn't work out how they determined the regions of interest (ROIs). If they couldn't define regions before they saw the calcium fluorescent fluxes based on some physiological, did they do it based on the flourescent time course after treatment? Region 3 and 4 seem pretty close, and it seems it could significantly change results if they misplace their region of detection.

Wednesday, October 9, 2013

Using awk: removing restriction site sequence from reads in fastq format

I had a fastq file in which the sequence reads contained the restriction enzyme site from RAD-seq reads. In order to use them with reads that had the site trimmed off already, I needed to get rid of the RE site.

The fastq file was of the format:
@1_1101_11092_1965_1
CCGGCACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT
+
HIHHIIHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC
@1_1101_11396_1963_1
CCGGCGACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA
+
HJJIJJJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD

I used the substring() function of awk to get rid of the first 5 characters (corresponding to the RE site) in each sequence or quality line as so:
#!/bin/bash
awk '(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)}
     (substr($0, 1, 1) == "@") || ($0 == "+") {print $0}' input.fastq > output.fastq

Output sans RE site:
@1_1101_11092_1965_1
ACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT
+
IHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC
@1_1101_11396_1963_1
GACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA
+
JJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD

Monday, October 7, 2013

Using awk: bitwise operations and string manipulation

I learned a few new things trying to reformat a sam file with awk. I also learned that what I was attempting to do wouldn't ultimately solve my problem, but I wanted to keep what I learned about bitwise operations and other string manipulations in awk here.

To test for the presence of a certain bit, I had to use GNU awk (gawk) since the awk on my system didn't have these functions, and I used the and() function (see here for bitwise operation documentation; see here and here for additional sam bit flag information).

I used substring() for string manipulation and length() to access elements from the end of the string (credit goes to here and here).

I used printf when I wanted to avoid generating a newline character (credit goes to here).

The original goal was to trim restriction sites from RAD reads and quality scores, but this doesn't account for the CIGAR string or the alignment coordinate so it ultimately doesn't work. Here is the gawk script in its entirety (with some comments):

#!/bin/bash
#Skip the header lines
gawk '(substr($0, 1, 1) != "@") {
        #Test if the bit flag has the 4th bit set (to indicate a read on the reverse strand using the and() function.  
        if (and($2,0x10)) {
                #Don't adjust the first 9 columns.
                for (i=1; i<=9; i++) {printf "%s\t" , $i}
                #Trim 5 bases from the end of the string and the quality score
                printf "%s\t", substr($10, 1, length($10)-5)
                printf "%s\t", substr($11, 1, length($11)-5)
                #Print any remaining columns; NF is a special variable containing the number of fields. 
                for (i=12; i<NF; i++) {printf "%s\t", $i}
                printf "%s\n", $NF
        }
        #Else it is not a reverse strand read; trim 5 bases from the start.
        else {
                for (i=1; i<=9; i++) {printf "%s\t" , $i}
                printf "%s\t", substr($10, 6)
                printf "%s\t", substr($11, 6) 
                for (i=12; i<NF; i++) {printf "%s\t", $i}
                printf "%s\n", $NF
        }
}
#Include the header lines.
(substr($0, 1, 1) == "@") {print $0}
' input.sam

Saturday, October 5, 2013

Journal Club: short ORFs and gene prediction

This week's journal club paper: Small open reading frames associated with morphogenesis are hidden in plant genomes by Hanada et al. (2013).

The implications of the idea that many functional, small open reading frames (sORFs) exist in eukaryotic genomes are pretty exciting. The paper didn't entirely convince me that this was the case, but I was sufficiently persuaded that it merits further investigation.

I didn't see any major red flags with the paper. I would have preferred biological replicates for their sORF arrays over technical replicates. Additionally, overexpression mutants can cause all kinds of wacky effects that aren't necessarily a consequence of the functional properties of the overexpressed genes; I'm not sure concluding that overexpression mutants of sORFs causes phenotypic consequences more frequently than random chosen genes is going to hold up to further scrutiny (maybe sORFs accumulate more effectively than regular protein products to cause strange side effects). However, again, I think it's sufficient to merit further investment in the topic, such as the knock-down/knockouts they propose.

My comment:

"I think it's worth continuing to pursue study of these small ORFs as the authors propose. They may have already shown this in one of their previous publications, but I'm interested in the properties of these small ORFs in the context of gene prediction. They said they used hexamer composition bias for prediction; are there other properties that might be useful for prediction, like CpG island promoters?"

Friday, October 4, 2013

Journal club: uORFs, eukaryotic "operon"?

Paper: Small open reading frames associated with morphogenesis are hidden in plant genomes. (Hanada et al 2013).

Overall, I liked the paper. It was the first time I had been made aware of the abundance of un-annotated small open reading frames, and I think it adds awareness in the field to something new, a method of detection, and an example of its applications.

Comment 1: Their investigation of expression of both RNA and protein reminded me that we often make assumptions about the relationships between transcript abundance and protein abundance; even if we detect translation, it doesn't guarantee that the we can detect transcription accumulating, and vice versa.

Comment 2: I had never heard of upstream ORFs; they sound conceptually similar to operons/polycistronic DNA found in prokaryotes, so I thought it was interesting that they exist in a eukaryotic system.

Comment 3: The paper described some techniques (controls) that I had never heard of before that I thought were interesting and useful. For example, determining the null distribution of the Ka/Ks ratio with random sequences to obtain a neutrality baseline to identify sORFs under selection was a new idea to me.

Comment/Question 4: What tissue types were used for the 17 different environmental conditions? They refer to seedling, but what tissue on the seedling?

Comment 5: They made a mention that other organisms such as humans contain sORFs, and specifically they reference the ENCODE project. I was wondering what ENCODE used to identify these sORFs; was it a similar algorithm, and if not, maybe it could be used to identify more un-annotated genes?

Comment 6: I'm a bit skeptical of their claim that their prediction algorithm has the best performance of identifying small ORFs. I agree that it performed well and was able to identify a significant number of previously un-annotated genes, however their training set to obtain their a prior hexamer composition bias was done in the same organism, Arabidopsis, that they go on to test in. I question whether it would perform as well in other organisms.

Wednesday, October 2, 2013

awk to transpose (super fast)

Credit goes to Stack Overflow - Transpose a file in bash. I was looking for a fast way to transpose a tab separated variable file, file.tsv, and so per instructions, I created the following shell script, transpose.sh:
#! /bin/bash
awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j] 
        for(i=2; i<=NR; i++){
            str=str"\t"a[i,j]; 
            # delimitting variable can be replaced (i.e "," or " " or "\t")
        }
        print str
    }
}' file.tsv

Then, I ran transpose.sh through the command line and printed the transposed file.tsv to transpose_file.tsv
 transpose.sh > tr_file.tsv

Friday, September 27, 2013

Journal club: publishable units?

Paper: UV-C-irradiated Arabidopsis and Tobacco emit volatiles that trigger genomic instability in neighboring plants from Yao et al. (2011).

After reading the paper, Frogee told me that the HRF, high recombination frequency, detection method used as a proxy of genomic instability was not well supported, and actually seems to be unaccepted.

This paper's story hinges on plant signalling that induces genomic instability, however it seems that they have not validated that the detection assay is a reliable one. That is, there is no evidence that "a single recombination event in the recombination cassette restores luciferase activity".

Comment:

The authors' interpretations are hard to swallow, I don't believe they showed what they claim, that is "a single recombination event in the recombination cassette [will] restore luciferase activity". Nonetheless, the genetic experimentation with mutants compromised for either SA and/or JA synthesis and/or perception supported that these two compounds were volatile, excreted under certain stresses, and perceived by plants. Whether or not these findings were a publishable unit without the stretch to high frequency recombination, I'm not sure.


Presenter did a good job; relevant background information, fielded the questions phenomenally well.

Journal Club: Homologous recombination reporter lines

This week's journal club paper: UV-C-irradiated Arabidopsis and Tobacco emit volatiles that trigger genomic instability in neighboring plants from Yao et al. (2011).

I was pretty surprised that this one got through peer review. Everything in this paper hinges on the ability of the reporter line to faithfully report homologous recombination events, but I was unable find any evidence that suggested this was the case. I was also unable to find this evidence in any of the citations.

 Fortunately, I wasn't the only one. A strong rebuttal regarding publications using these somatic homologous recombination reporter lines was published a year later in the same journal: Reevaluation of the reliability and usefulness of the somatic homologous recombination reporter lines from Ulker et al. (2012).

I suspect everyone's comment in journal club will be the same, and something along the lines of:

"This paper presents an interesting idea, but I think it fails to demonstrate that their reporter line actually reports only homologous recombination. I think that a 2012 rebuttal paper from Ulker et al. about these somatic homologous recombination report lines successfully defends the idea that there are many other interpretations of reporter activity in these lines, including stress induced read-through transcription."

Monday, September 23, 2013

Change MySQL data file directory (Ubuntu 12.10 64-bit)

Since our MySQL database is stored on an individual hard drive, and the database is growing beyond the size limits of the hard drive, a temporary solution is to move it to a larger drive. Another temporary measure is to clear the binary logs as described in a previous post. I suspect the real solution is to use a distributed database (also see the Wikipedia entry), but I don't think our use case is quite ready for that yet. This post is on how to change the MySQL data file directory.

This is using a MySQL 5.5.29 server installed via the apt package manager on Ubuntu 12.10 64-bit (MySQL server version 5.5.29-0ubuntu0.12.10.1-log)

Credit goes to this post and this forum post for the the information.

If you don't already know where on the hard drive your MySQL database files are stored, identify their location, and then stop the server:
#Enter the mysql shell
$ mysql

> SHOW VARIABLES LIKE 'datadir';

+---------------+-----------------+
| Variable_name | Value           |
+---------------+-----------------+
| datadir       | /var/lib/mysql/ |
+---------------+-----------------+

#Exit the shell and shut down the server
> exit
$ sudo service mysql stop
Now we want to copy the database over to the new location and retain the previous permissions. I'm choosing to copy in this case instead of moving in case something goes wrong (although I recommend that you should have additional backups up your database for the event of something like drive failure).
sudo cp -p -r /var/lib/mysql /path/to/new/location
Next we want to tell mysql where to look for the data directory on startup. I prefer to modify the my.cnf configuration file to save having to specify it when starting. My text editor of choice is vim.
sudo vim /etc/mysql/my.cnf
Here you want to change the datadir line under the [mysqld] section
#I'm choosing to comment out the old line in case disaster strikes and I need to get back to start.
#datadir     = /var/lib/mysql
datadir      = /path/to/new/location/mysql
Save those edits. We also need to modify the apparmor profile:
sudo vim /etc/apparmor.d/usr.sbin.mysqld
Add the following near the end before the closing bracket:
#You should see similar directives for your previous data directory, so you can follow their lead (i.e. /var/lib/mysql/ r, and /var/lib/mysql/** rwk,)
/path/to/new/location/mysql/ r,
/path/to/new/location/mysql/** rwk,
There was a report that having lines for the previous data directory would cause problems. Leaving those lines didn't seem to cause any adverse effects for me, but they may need to be removed or commented out in other cases.
Save the edits and then restart apparmor:
sudo /etc/init.d/apparmor restart
If that went well, you can you can start the mysql server and make sure everything is as expected.
sudo service mysql start

Saturday, September 21, 2013

Journal club: sweet plots

This week's journal club paper is: Genome-wide analysis of histone H3.1 and H3.3 variants in Arabidopsis thaliana Stroud et al. (2012)

I thought it was very resourceful of Stroud et al. to pool together large datasets to make connections for H3.1 and H3.3 functional depositions. I have a couple of questions:

Is it possible that their use of H3.1/H3.3 Myc-tags transformed into the plant limits their ability to assay native H3.1/H3.3. So, maybe their absence of presence isn't necessarily evidence for absence? Is their concern for differential binding of the Myc-tags?

Frogee said that making Myc-tags is probably easier than finding specific antibodies for H3.1/H3.3. He's probably right, but I'm still curious how much more difficult making antibodies and if doing CHIP-seq with those antibodies would have yielded differences.

This is the first time I've encountered or maybe paid attention to plots that show values plotted against enrichment and distances from enrichment. I have no understanding of how this is done, but I think that I need to explore this technique of clustering these regions, because at least for me the figures look convincing. 

I'm curious as to how the H3.1 and H3.3 have been shown to evolve separately in plants and animals.

My comment:

This is the first time I've encountered plots that show values plotted against enrichment and distances from enrichment. I don't know how this technique of clustering these regions is done, but I want to find out, because, at least for me, the figures look like a convincing way to summarize data from many different regions of the genome.

Friday, September 20, 2013

Journal Club: Nucleosome prediction

This week's journal club paper was: Genome-wide analysis of histone H3.1 and H3.3 variants in Arabidopsis thaliana Stroud et al. (2012)

The number of existing datasets that Stroud et al. were able to pull genomic data from was, I think, a nice example of putting Arabidopsis' rich research history to good use. It's also nice to see when some general chromatin features are conserved between plants and animals, although its quite surprising/borderline unbelievable that H3.1 and H3.3's functions in both systems are a product of convergent evolution.

My comment:

"It's not really the purpose of the paper, but I was excited to see that there exists a nucleosome prediction algorithm that appears to work well in C. elegans and, as this paper suggests, in Arabidopsis.  I think the idea that DNA sequence is a good predictor of nucleosomal location across large evolutionary distances could permit searching for higher order chromatin structures on the basis of DNA sequence."

Monday, September 16, 2013

Gamasutra Post: Overview of Motion Planning by Matthew Klingensmith

Motion planning is one of the research interests of one of my committee members. This post on Gamasutra written by Matthew Klingensmith discusses the use of motion planning for AI pathfinding; it was a helpful introduction for me. This stuff is amazing.

An excerpt from the article:

"In game development literature, the act of getting from point A to point B is usually called "pathfinding," yet I've been calling it "motion planning." Basically, motion planning is a generalization of pathfinding with a looser definition of "point" and "path" to include higher dimensional spaces (like rotations, or joints).

Motion planning can be thought of as pathfinding in the configuration space of the Agent."

Saturday, September 14, 2013

Journal club: Specialist phenomena stems from differential concentrations

Journal article for 09/13/2013:
An amino acid substitiution inhibits specialist herbivore production of an antagonist effector and recovers insect-induced plant defenses by Schmelz et al. (2012)

Overall, the research presented seems reasonably executed as well as interpreted.

It's a misnomer to call  [Vu-In$^{-A}$] "inactive" when it has antagonistic repressive functionality.

The dynamics of plant reaction due to different ratio of inceptin concentration produced by FAW and VBC reminded me of a somewhat recent news article, Plants perform molecular maths. While not as complex as the rate of starch degradation discussed in the news article as these inceptin classes might just be competing for receptor real-estate, the data suggest that their might be something more complex occuring.

After finding that VBG repressed plant reaction to predation by itself through production of more  [Vu-In $^{-A}$] than [Vu-In], they performed a titration of the two compounds to gauge plant reaction dependent on different ratios.

Figure 3B shows the results of different concentrations. I found it interesting that when [Vu-In $^{-A}$] : [Vu-In] were 1:1, the ET production was the same as when the entire treatment was [Vu-In].  More interestingly, the pseudo dose-response of each of inceptins can be seen in Figure 3A which can describe the paired dose response.

So, since this specialist phenomena seems to be the result of different concentrations of inceptins. It would have been nice to see the paired dose response for Vu-In $^{-A}$ and Vu-In $^{\bigtriangleup V}$ to see if the defense response is considerably antagonized by the ~77% Vu-In $^{-A}$ as it was in Fig 3B.

TL;DR (comment):
With respect to Figure 3B, I found it interesting that in the paired-dose response when the two inceptins were 1:1, the ET production was similar as when the entire treatment was [Vu-In].  I agree with the authors, that, based on their data, it looks like this specialist phenomena stems from differential concentrations of inceptins in the OS. I think it would have been nice to see the paired dose response for Vu-In $^{-A}$ and Vu-In $^{change V}$ to see if the defense response is again considerably antagonized by the ~77% Vu-In $^{-A}$ as it was in Fig 3B.

Friday, September 13, 2013

Journal club - Green leafy volatiles

Green leafy volatiles got voted in as one of the topics for this semester's journal club.

Today's journal club paper is An amino acid substitution inhibits specialist herbivore production of an antagonist effector and recovers insect-induced plant defenses by Schmelz et al. (2012).

In my opinion, this paper was reasonably solid but poorly delivered. The experiments and concepts presented were simple, but the authors seemed to go out of their way to make them opaque; the introduction and discussion were a bit aimless. I think a simple re-write would increase the cogency and leave more room to make more intuitive figures.

The conclusions seem fine, although I'm surprised the authors didn't also follow up with the aspartic acid substitution (Fig 4B) a bit further since the 11-mer dramatically reduced the Vu-In$^{-A}$ recovery (even if the 19-mer version didn't perform as well).

The Llama, Precocious as she is, suggested that it would have also been nice to see a plot like Fig 3B (the paired dose response) for Vu-In$^{-A}$ and Vu-In$^{\bigtriangleup V}$ to see if the defense response is considerably antagonized by the ~77% Vu-In$^{-A}$ as it was in Fig 3B. However, I suppose we'd expect it to be similar to the Vi-In$^{-A}$ and Vu-In in Fig 3B since it looks like the response difference is caused by differences in proteolysis efficiency that changes the relative abundance in the saliva (Fig 4B and C).

And for my comment:

"I thought the authors' interpretations of their findings were reasonable, although showing what caused the VBC larvae to produce the altered inceptin ratio would be more convincing. This paper reminded me of the Red Queen hypothesis for host/parasite or predator/prey relationships, and for me its amusing to think that plants evolved to detect digested bits of themselves; it seems reasonable that this provides more information regarding the source of the damage and what the appropriate defense response should be rather than just general mechanical damage sensing."

Thursday, September 12, 2013

Gamasutra and Ars Technica posts on WebGL and JavaScript performance

From a post on Gamasutra by Jasmine Kent about 3D rendering in the browser using WebGL (found here), I ended up on a post on Ars Technica by Peter Bright claiming Mozilla's ability to produce near native performance using JavaScript (found here).

An excerpt from the Gamasutra article:

"WebGL is OpenGL for the browser, providing access to the power of the GPU, but with some constraints. Importantly, CPU-side data processing with JavaScript is slower than in a native app. Transferring data to the GPU involves more security checks than in a native app, to keep Web users safe. However, once the data is on the GPU, drawing with it is fast."

An excerpt from the Ars Technica article:

"The fact that asm.js doesn't look like JavaScript any human would produce might seem like a problem. Scant few developers of native code programs use assembler, and asm.js is even more feature-deprived than most real assembly languages. Mozilla doesn't really intend for developers to write asm.js programs directly, however. Instead, the idea is that compilers use asm.js as the target, with programs themselves written in some other language.

That language is typically C or C++, and the compiler used to produce asm.js programs is another Mozilla project: Emscripten. Emscripten is a compiler based on the LLVM compiler infrastructure and the Clang C/C++ front-end. The Clang compiler reads C and C++ source code and produces an intermediate platform-independent assembler-like output called LLVM Intermediate Representation. LLVM optimizes the LLVM IR. LLVM IR is then fed into a backend code generator—the part that actually produces executable code. Traditionally, this code generator would emit x86 code. With Emscripten, it's used to produce JavaScript."

I'd only come across WebGL in passing, and I had never heard of asm.js, nor of the ability to compile C/C++ to JavaScript. 3D rending and native performance, all in the browser? Awesome-o-meter set to maximum indeed if this stuff works as intended.

Tuesday, September 10, 2013

Gamasutra post: Watch the intriguing '3-Sweep' 3D modeling technique in action

I read some pretty interesting articles today, mostly on DNA methylation, but this post from Gamasutra containing a video of a 3D modeling technology called the "3-Sweep" method topped the charts of the awesome-o-meter for today.


Friday, September 6, 2013

Questions about models to analyze selective pressures

We attend a journal club where we read an article and watch a presentation given by the student that chose the paper, once a week. This is my second semester in a journal club, and I want to start documenting my thoughts/questions on certain papers.

Today's paper is Comparative transcriptomics reveals patterns of selection in domesticated and wild tomato by Koenig et al. (2013) . This paper has presented some new-to-me analysis of gene expression, and in particular I hope to learn more about the models of evolution that they used in their analysis of selective pressures.

Currently, I'm having a difficult time understanding the conclusions they draw with respect to their results of fitting three models of evolution: Brownian motion single rate, Ornstein-Uhlenbeck, and Brownian motion two rate model. As a disclaimer, I'm not all that familiar with these models, or their analyses, so my concern may be totally off-base.  

My interpretation: After fitting the models and using Akaike information criteria statistical test they found genes that best fit the Brownian motion two rate model, and for those genes:
  •  S. pennellii branch had the largest number of genes 
  •  S. lycopersicum branch had the largest proportion of unique genes 
(Like I said, I'm unclear on how those were calculated.) So, on the off chance that I understood that correctly, my question is why they only hypothesized that the "rapid divergence in gene expression that has occurred in S. pennelli can be explained by neutral process."  

My confusion:
  • Divergence is calculated/assessed by comparing two objects, in this case branches. So, shouldn't there be an equally reasonable argument for evolution on the other branch, e.g. that human selection (domestication) occurred on the other branches that leads to the "rapid divergence"? 
  • And, why do they choose the neutral process? My limited understanding of Brownian motion says that BM does not only describe random drift, but other mechanisms follow this trend such as randomly changing selective regimes and continued change in independent additive factors of small effects. 

Conclusion:
Another reason I am totally aware that I could be way off-base is because in their discussion, given what I believe are results of the same type of analysis, they draw conclusions the way I expect them to be drawn:

"The most extensive network requiring that we discovered in S. lycopersicum relates to light responsiveness. Loss of connectivity in this network may reflect selection for reduced light response in S. lycopersicum or may reflect a more robust response in the desert-adapted S. pennellii..."

Their acknowledgement that either of those changes may be what we are seeing is reasonable, whereas before their seemingly one-sided conclusion seems unreasonable.

Halp, please.

TL;DR (comment for journal club)

This paper presents a lot of new-to-me analyses. In particular I think it was innovative to use models of evolution on gene expression data when I believe that they are more commonly used on more traditional phenotype data like physical characteristics. However, their interpretation of the results with respect to fitting the models caused some confusion for me:

My limited understanding of the Brownian motion model is that it does not only describe random drift. Other mechanisms follow this trend such as randomly changing selective regimes or continued change in independent additive factors of small effects. However, of the multiple interpretations of the Brownian motion model, the authors only chose the n
eutral process interpretation without defending it.

Non-synonymous base substitutions

The Precocious Llama had the idea to keep track of some thoughts for journal club, and I'm joining in because I thought it was a good idea.

The paper was Comparative transcriptomics reveals patterns of selection in domesticated and wild tomato by Koenig et al. (2013).

I thought the paper was a fun introduction to quite a few analyses I hadn't encountered before, including the models the Precocious Llama discusses in her post. I can't speak to their technical merits since I know very little about them, but it didn't seem like terribly many actionable conclusions were drawn. The Llama pointed out these examples of vague conclusions: "Enrichment for these categories indicates that abiotic and biotic stresses have played a major role driving transcriptional variation among these species" (do any stresses exist outside of these two categories?), and "Taken together, our studies highlight both parallels and contrasts between natural and artificial selection and their effects on genome evolution".

My specific comment for journal club:
  • At least in the mammalian and yeast literature, there is evidence that some synonymous base substitutions (which the authors call neutral divergence) are actually under selection, especially in regions of condensed chromatin structure. I think this provides an alternate explanation for what the authors say is a reduction in neutral divergence near the centromere shown in Figure 1. 
References for my comment:

Thursday, September 5, 2013

Nature Jobs article: Two-body blessing by J.T. Neal

This post was written by J.T. Neal and posted at Nature Jobs. It's commonly referred to as the "two-body problem", but I agree with Dr. Neal; it's completely awesome to get to work together all the time. I wouldn't trade it for anything.

Wednesday, September 4, 2013

Gamasutra post: Modeling by numbers by Jayelinda Suridge

This post on procedural geometry posted on Gamasutra written by Jayelinda Suridge looks to have a useful tutorial for building 3D models from code. I suspect C++ has mesh building libraries as well.

Edit:

Additional parts:

Sunday, September 1, 2013

Postmortem: stacksToVCF

Now that the stacksToVCF program is, for the most part, complete, I wanted to do a brief postmortem on it.

What worked:

  • External library: It was the first time I've used a library outside of the C++ Standard Library. I used mysql++, which is a wrapper for the MySQL C API. Using an external library was much easier than I had thought it would be, probably thanks to the excellent documentation for mysql++.
  • Object Oriented Programming: It was the first time I've tried to organize and implement a program using classes. I had an easier time thinking about the code with this abstraction in place.
  • C++: I had considered doing this in Python, but since it was a slightly bigger project than my typical throw away code, I went with C++ because I believe it forces me to be less sloppy (that's not to say that I write clean code in C++). With Python, I almost never think about memory or data structures, and I like that C/C++ are a little less forgiving. For example, I had an issue where I was getting really strange behavior using sprintf(); it was changing the value of a different variable. As it turns out, it was because I was accessing memory outside of the bounds of the array I was interested in, and this was causing undefined behavior. I think programming in C/C++ is much more instructive than Python (although I think it's tough to beat Python in terms of time to write).
  • vim: I used gvim as an IDE, using the c.vim plugin. For my purposes, it worked great, and I hope to get better with it.
  • Git: Using git and GitHub were useful and relatively painless. I hope to get more familiar with these tools. 


What didn't work:

  • My understanding of C/C++: I definitely need to work on my understanding of the relationship between pointers, pointers to arrays and strings; i.e. char * , char array[], and std::string. This caused me a bit of trouble, and I eventually got lazy and went with std::string even though that required a bit of type conversion between the mysqlpp::String to the std::string (mysqlpp::String has an automatic type conversion to a const char * , but not to a std::string). 

Lastly, I asked Precocious Llama's father, who spent his career doing systems software engineering, to take a look at my code and provide suggestions on how to improve. He suggested the following:
"
Here are a few things that I would do differently:

1)  Command line error outputs:  You have 3 different error outputs.  I would only have one.  When any part of the command line is incorrect, I would just output the "manual".  The manual just basically says:  Here's how to use this program...

2)  for (int i = 0; ...)
While this is perfectly correct for C++, it's not acceptable by most C compilers.  And it offers no advantage over the traditional C way.  So I would do:

main (...)
{
   int i;

...
   for (i = 0; ...)
...
}

3) C++ allows variable redefinition.  I don't see it as an advantage.  That feature can lead to very confusing (and thus hard to debug) code.  While you are not doing it now, it could become a habit later on:  You declare new variables inside your code and just before they are needed:

  int outputCounter = 0;
  int errCounter = 0;

I like the following better:
routine (...)
{
   // All variables declared here

   // Code begins here.  No variable definition from this point on
}
"

Thursday, August 29, 2013

Update Stacks installation

Julian Catchen has written a very useful piece of software called Stacks for analysis of restriction site associated DNA (RAD) data. The stacksToVCF code that I recently wrote may have exposed a bug in how Stacks processes the CIGAR string from read alignments with indels, and I'm updating my Stacks installation to test a fix pushed out by Julian.

The following is only for updating, not for a fresh installation.

Note that you need to have Google's sparsehash and samtools' bam library and header available for these instructions. For this case, I compiled both sparsehash and samtools from source, and installed sparsehash via make install. Since I didn't do that for the bam header and library, I have to specify them below.
#Get the most recent version (1.06 in this case):
wget http://creskolab.uoregon.edu/stacks/source/stacks-1.06.tar.gz

#Unarchive and decompress the .tar.gz:
tar -zxvf stacks-1.06.tar.gz

#Move to the directory and configure:
cd stacks-1.06
./configure --enable-sparsehash --enable-bam \
--with-bam-include-path=/path/to/directory/with/samtools_bam.h \
--with-bam-lib-path=/path/to/directory/with/samtools_bam.o

#make
make

#install
sudo make install
Then you need to reconfigure the stacks_export_notify.pl file to be able to send email:
sudo vim /usr/local/bin/stacks_export_notify.pl
And modify the configurations to suit your system. For my case, it's just modifying the url variable to point to my server. All other settings can be left as they are. You can test that this was successful by invoking the version option for any of the component programs:
$ ref_map.pl -v
ref_map.pl 1.06

Monday, August 26, 2013

Bioconductor is out-of-date: update R to version 3.0 in ubuntu

I wanted to install cummeRbund into a new workstation, per the protocol of TopHat and Cufflinks (Trapnell et al 2012), however I ran into a problem with R version and bioconductor. This may not be relevant soon as we believe that R 3.0 will soon be put into the repository for apt-get, but who knows it may be useful: Here is the error that I got: After going into R:
$ R
$ source("http://bioconductor.org/biocLite.R")
I got the error:
Your Bioconductor is out-of-date, upgrade to version %s by following instructions at http://bioconductor.org/install
However for me this was not the case, and following instructions on Bioconductor's website did not help. I found that updating my version of R from 2.14 to 3.0 solved my problems. The instructions to update R to 3.0 can be found on the R website . Here is what I did: 1. removed older version of R
$ sudo apt-get remove r-base-core
2. Add deb to sources.list using vim: 2a. open sources.list using vim
$ sudo vim /etc/apt/sources.list
2b. scroll down to the bottom holding down Ctrl + F
2c. type "I" to insert text, and type in the following
## yourname modification on date    
deb http://cran.rstudio.com/bin/linux/ubuntu percise/
(my ubuntu version is precise) 2d. push "Esc" button to exit insert mode
2e. type ":w" and "Enter" to save
2f. type ":q" and "Enter" to exit vim editor
3. Add key to sign CRAN packages
$ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
4. Add specific PPA to the system
$ sudo add-apt-repository ppa:marutter/rdev
$ sudo apt-get update
$ sudo apt-get upgrade
5. Install latest version of R:
$ sudo apt-get install r-base
Now I had the right version of R, so I could install bioconductor and then cummeRbund: 1. open R
$ R
2. Install CummeRbund package:
> source("http://www.bioconductor.org/biocLite.R")
> biocLite('cummeRbund')
3. load CummeRbund library:
> library(cummeRbund)
4. Now you should be able to use the cummeRbund package as described in the protocol!

Friday, August 23, 2013

mysql++ compile example Linux Ubuntu 32-bit (a.k.a. example of linking libraries, makefile, headers for C++)

This is more of an exercise in linking libraries and Makefiles than something specific to mysql++, but since I didn't know how to do it before, here goes:

To compile one of the mysql++ examples on my system, I had to specify where the header files were for both the mysqlclient, which I had installed with the apt package manager (see this post), and for mysql++, which I had compiled from source (again, see this post). I also had to specify where the shared object libraries were. The -I option contains the include paths, the -L option contains the shared object paths, and the -l option contains the names of the .so files (excluding the lib and .so component; e.g. -lmysqlpp looks for libmysqlpp.so). On my system, this worked for compilation:
g++ simple1.cpp -I/usr/include/mysql -I/usr/local/include/mysql++ -L/usr/local/lib -L/usr/lib/i386-linux-gnu -lmysqlpp -lmysqlclient -o simple1
And at runtime, the .so files are searched for, so their location needs to be specified. The libmysqlclient.so was installed by the package manager into /usr/lib/i386-linux-gnu and the libmysqlpp.so was installed to /usr/local/lib by default. I ran this and also appended this to my .bashrc file.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/i386-linux-gnu:/usr/local/lib
Once that was working, I moved over to my code. My goal is to learn how to interact with a MySQL database using C++. First I needed to get some basics down, which was simply compiling multiple files since I've evidently forgotten the small amount I knew about C++.

I have 4 to start with:

The main file, stacksToVCF.cpp containing:
#include <stdlib.h>
#include <iostream>
#include "stacksDB.h"
int main ( int argc, char *argv[] )
{
 testcon(argc, argv);
 std::cout << "Hello from main" << std::endl;
 return EXIT_SUCCESS;
} 
stacksDB.h containing
#ifndef  STACKSDB_H_INC
#define  STACKSDB_H_INC

int testcon(int argc, char *argv[]);

#endif 
stacksDB.cpp containing
#include <stdlib.h>
#include <iostream>
#include <mysql++.h>
#include "stacksDB.h"


int testcon (int argc, char *argv[])
{
 mysqlpp::String greeting("Hello from mysqlpp");
 std::cout << greeting << std::endl; 
 return EXIT_SUCCESS;
}
and the Makefile, containing:
CXXFLAGS := -I/usr/include/mysql -I/usr/local/include/mysql++
LDFLAGS := -L/usr/local/lib -L/usr/lib/i386-linux-gnu
LDLIBS := -lmysqlpp -lmysqlclient 
EXECUTABLE := stacksToVCF

all: $(EXECUTABLE)

stacksToVCF: stacksToVCF.o stacksDB.o
        g++ stacksToVCF.o stacksDB.o $(CXXFLAGS) $(LDFLAGS) $(LDLIBS) -o $(EXECUTABLE)

stacksToVCF.o: stacksToVCF.cpp
        g++ -c stacksToVCF.cpp

stacksDB.o: stacksDB.cpp
        g++ -c stacksDB.cpp $(CXXFLAGS) $(LDFLAGS) $(LDLIBS)

clean:
        rm -rf *o $(EXECUTABLE)
This compiles successfully with make and will be the starting point.

compile mysql++ from source on Linux Ubuntu 32-bit

Trying to compile mysql++ from source on Ubuntu 12.10 32-bit: Downloaded most recent version of mysql++ from here. Made sure I had the C API on which mysql++ depends:
sudo apt-get install libmysqlclient-dev
Moved to the mysql++ directory and configured for the mysql installation (note that I found where these were located with this, and got tips from the mysql++ README):
./configure --with-mysql-lib=/usr/lib/i386-linux-gnu --with-mysql-include=/usr/include/mysql
make
sudo make install
And the final output after the install noting install directories:
mkdir -p /usr/local/lib
/usr/bin/install -c -m 644 libmysqlpp.so /usr/local/lib
/usr/bin/install -c libmysqlpp.so.3.2.0 /usr/local/lib
(cd /usr/local/lib ; rm -f libmysqlpp.so libmysqlpp.so.3; ln -s libmysqlpp.so.3.2.0 libmysqlpp.so.3; ln -s libmysqlpp.so.3 libmysqlpp.so)
mkdir -p /usr/local/include/mysql++
(cd . ; /usr/bin/install -c -m 644  lib/*.h /usr/local/include/mysql++)

Thursday, August 22, 2013

Queue command to run after existing command finishes

Credit goes to this post at superuser. Probably due to poor planning/pipeline development, I sometimes want to queue up some other commands execute after an existing, long-running process has already been running for a while. I use Ctrl-z (pause) and fg (foreground) for this.
#Execute a long running program
./long_running_program.sh
#Realize you want to have another program execute after the first concludes, and hit Control-z to pause
^Z
#Set up a bash command to resume the paused program with fg, and run the next in queue.
fg; ./next_long_running_program.sh

Wednesday, August 21, 2013

awk print every other line (or every Nth line) in fasta file

This specific line of awk doesn't have much general utility, but it was intended to pull out every other sequence record in a .fasta file. It can be applied to every Nth record in the fasta file as well by changing the modulo operator statement.  It only applies to .fasta files in which the sequence string isn't wrapped into multiple lines.

Here it is in its one-liner form:
awk 'BEGIN{i=0} (substr($0,1,1) == ">") { if (i%2 == 0) {print $0; getline; print $0} i++}' test.fa
And it makes a bit more sense when formatted:
awk 'BEGIN{i=0} (substr($0,1,1) == ">") {
 if (i%2 == 0) {
  print $0
  getline
  print $0
 }
 i++
}' test.fa
This assumes the .fasta file is of the format:
>SequenceID1
ATGACTA
>SequenceID2
AGGCATG
and the sequence string is contained entirely on one line.

Monday, August 19, 2013

Create directory from R

Credit goes to this post at StackOverflow. As our genotyping and QTL mapping pipeline continue to scale up, it is helpful to automatically isolate R/qtl output for each trait into their own directories. As stated in the stackoverflow post by user robbrit, the dir.create() function creates a directory in the working directory, and throws a warning if it already exists.
dir.create("testDir")
I believe robbrit's use of file.path() is a more flexible, platform independent method to do it. "testDir" could be a traditional file path (e.g. "/home/user_name/Documents/testDir"), otherwise it creates it in the working directory.
dir.create(file.path("testDir"))
And finally set the working directory to the new path.
dir.create(file.path("testDir"))
setwd(file.path("testDir"))

Tuesday, August 13, 2013

MySQL Purge Binary Logs (aka MySQL is taking up too much hard drive space)

Credit goes to this MySQL documentation.

My MySQL installation by default keeps a set of binary log files recording data changes. With many database modifications, these grow to be very large (terabytes). It isn't mission critical that we be able to recover our database at any given moment since our raw data is archived, and the database can be repopulated if need be, so I typically clean these out once hard drive space starts becoming an issue.
$ mysql -u root
mysql> PURGE BINARY LOGS BEFORE CURDATE();
This gets rid of all binary logs made before the date you run the command. You could also set the system variable expire_logs_days.

Location of MySQL database files on Linux Ubuntu

I installed MySQL from the apt package manager on Linux Ubuntu (12.10), and my database files are located at:
/var/lib/mysql

Wednesday, August 7, 2013

Compile Genome Analysis Toolkit (GATK) from source


07/28/14 Edit: I don't think this post is up to date any more. Word on the street is that they're using something else as the build tool.
----------------------------------------------------------------------------------------------------------------------------------


I'm planning on using the Broad Institute's Genome Analysis Toolkit (GATK) for some variant calling. The Java source code is available on GitHub, so I forked it over so I could have a look at the code (and play with it one day?). I've never compiled any Java code before, so I figured it was a good time to learn. I was doing this on a brand new workstation in our lab that I'm administrating, so I had to get a few dependencies first:
sudo apt-get install git ant openjdk-7-jre openjdk-7-jdk
Git is a version control software, ant is (from my understanding) a build tool for Java, conceptually similar to GNU make, and, based on the name, I assume the others are Java version 7, the Java Runtime Environment and the Java Development Kit, respectively. Then I cloned my forked repository:
git clone https://github.com/Frogee/gatk-protected.git
My best guess is that a build.xml file for ant are conceptually similar to a Makefile for make, although I'm not entirely sure. Changing to the top-level of the cloned directory, I simply ran:
ant
And everything compiled successfully. From then on, I could move to the dist directory and test out the GATK:
java -jar GenomeAnalysisTK.jar --version
Good times ahead.

Tuesday, August 6, 2013

VirtualLeaf tutorial: Fatal error message stepwise underflow in rkqs, with h=0.000000 and htry = 0.100000

TL;DR: In order to successfully run a model you have to have a suitable geometry for the model to initiate.

When trying to run parts of the VirtualLeaf tutorial, I have been running into the following error when running the simulation.
Fatal error message stepwise underflow in rkqs, with h=0.000000 and htry = 0.100000

Per usual, this turns out to be my lack of understanding on how VirtualLeaf works (baby steps). From my current understanding, one can build models, i.e. mymodel.cpp, and one can run those models on different initial cellular geometries called "leaves", i.e. *.xml files.

It turns out that I was lacking the correct leaf geometry, *.xml file which is defaulted in my header file for the model, i.e. mymodel.h .In order to successfully run a model you have to have a suitable geometry for the model to initiate. I know of how to modify this in two ways:

1) Through your model's header file, yourfavoritemodel.h :
 virtual QString DefaultLeafML(void) {
  return QString("yourfavoriteleafgeometry.xml");
 }

You can change yourfavoriteleafgeometry.xml to another *.xml which I found in VirtualLeaf-v1-0.1-src/data/leaves/* . Save file, make, reopen VirtualLeaf and simulate.

2) Or, you can create your leaf geometry, stop simulation, save File -> Save Leaf -> yourfavoriteleafgeometry.xml.

Just make sure that the DefaultLeafML Qstring is the the correct leaf geometry you need for your model!

Gamasutra post by Timo Heinapurola: "What documentation?"

This post written originally by Timo Heinapurola here and reposted by Gamasutra discusses the topic of code documentation. I'm guilty of writing code thinking that it's going to be "throwaway code" and then I find myself needing to use it again months later.

Monday, August 5, 2013

Gamasutra blog post by Lee Perry: "One reason we see so many clones? Communication."

This article by Lee Perry posted on the video game business news website Gamasutra is strangely applicable to research as well.

"I want to make a figure like they did in this paper."

I could also use a Mindlink 2000.

Saturday, August 3, 2013

Make bootable USB Linux Ubuntu

I find myself doing frequently making bootable USBs,  so I wanted to document it here.

I can't speak for other distributions, but the versions of Ubuntu I use have an application called Startup Disk Creator. Simply launch that application, choose the .iso file you'd like to use, and choose the USB you'd like to use. I usually reformat the USB (with the Erase Disk button) before each try.

I've had troubles with Startup Disk Creator crashing, but I just rinse and repeat until it successfully completes.




Tuesday, July 30, 2013

Rename error in Linux VirtualLeaf tutorial: "Bareword "tutorial0" not allowed while "strict subs" in use at (eval 1) line 1."

I'm going through the VirtualLeaf tutorial, and while it has options for both Windows and MacOSX/Linux, it does not always work for me.

When trying to "rename all the remaining files ..." (step 4 page 339 of Building Simulation Models of Developing Plant Organs Using Virtual Leaf)

I got the following error:

$ rename tutorial0 mymodel *
Bareword "tutorial0" not allowed while "strict subs" in use at (eval 1) line 1.

This is because rename syntax is different for Ubuntu in which it is expecting
$ rename 's/search/replace/;' file1 [file2 file3...]

So, I used the following
$ rename 's/tutorial0/mymodel/;' *

and checked it with
$ ls

to see that it indeed had renamed my files! I found the answer to this here.

Nature News article: Quantum boost for artificial intelligence

The original article is posted on the Nature website here. I don't understand it, but quantum computing sounds amazing if it actually works.

Monday, July 29, 2013

Beginnings with Git

I've heard many times before that version control is important, especially as soon as more than one developer is working on a project. Since Precocious Llama and I are likely to be working on the same code in the future, I figure it's about time to bite the bullet and learn a way to do version control.

After some research, it looks like Git in conjunction with GitHub will serve our purposes best.

GitHub has some nice tutorials for getting started (here), as does Git (here). Lots to learn, and it definitely looks useful as things start to scale up.

Lastly, the Git repository we're playing with can be found here.

Advice from a systems software engineer

Precocious Llama's father spent most of his career working in R&D at HP developing systems software. This weekend, he stated that in making a hiring decision, he would expect any entry level computer scientist to have some background in the following topics:
  1. Data Structures
  2. Algorithms
  3. Operating Systems
  4. Networking/Communications (know at least one network protocol well)
He also stated that it was one thing to know theory, and another thing altogether to be able to implement code to solve a problem. He said he expected that an interviewee would have some kind of portfolio to demonstrate that the individual can be presented with a problem and successfully code and test a solution.

Friday, July 26, 2013

compile errors in Linux VirtualLeaf installation

I will want to remember this for my next installation of VirtualLeaf (Merks et al. 2011).

For the linux installation from source code VirtualLeaf-v1.0.1-src.zip :

I followed the instructions found in Building Simulation Models of Developing Plant Organs Using VirtualLeaf by Merks and Guravage. One first needs to install Qt from qt-project.org per the instructions. Then download and extract VirtualLeaf.

Before compiling... you will need to modify the mesh.h header file (/home/precociousllama_youruser/VirtualLeaf-v1.0.1-src/src/mesh.h)
You can open mesh.h in any editor, and find the following:

  template<class op=""> void RandomlyLoopNodes(Op f) {

    MyUrand r(shuffled_nodes.size());
    random_shuffle(shuffled_nodes.begin(),shuffled_nodes.end(),r);

    for (vector<node>::const_iterator i=shuffled_nodes.begin();
  i!=shuffled_nodes.end();
  i++) {
      f(*shuffled_nodes[*i]);
    }
  }

  template<class op=""> void RandomlyLoopCells(Op f) {

    MyUrand r(shuffled_cells.size());
    random_shuffle(shuffled_cells.begin(),shuffled_cells.end(),r);

    for (vector<cell>::const_iterator i=shuffled_cells.begin();
  i!=shuffled_cells.end();
  i++) {
      f(*shuffled_cells[*i]);
    }
  }

This will cause errors in compilation as pointed out by the developer Merks (issue 6).

Comment this section out by adding "/*" at the beginning and "*/" at the end. So, that it looks like:
 /*
  template<class op=""> void RandomlyLoopNodes(Op f) {

    MyUrand r(shuffled_nodes.size());
    random_shuffle(shuffled_nodes.begin(),shuffled_nodes.end(),r);

    for (vector<node>::const_iterator i=shuffled_nodes.begin();
  i!=shuffled_nodes.end();
  i++) {
      f(*shuffled_nodes[*i]);
    }
  }

  template<class op=""> void RandomlyLoopCells(Op f) {

    MyUrand r(shuffled_cells.size());
    random_shuffle(shuffled_cells.begin(),shuffled_cells.end(),r);

    for (vector<cell>::const_iterator i=shuffled_cells.begin();
  i!=shuffled_cells.end();
  i++) {
      f(*shuffled_cells[*i]);
    }
  }
*/

Save mesh.h and continue with the compilation procedure.

If you get an error that says something along the lines of "qt3"

Check the versions of your qmake and qmake-qt4. You can do so in the Ubuntu command line with:

$ qmake -v

and

$ qmake-qt4 -v

and make sure the versions are the same. Then, try again!

I haven't done anything but look at the pretty pre-loaded models, but it's going to be awesome, for sure. Thanks Merks and Guravage!

Their original paper is titled: VirtualLeaf: An open-source framework for cell-based modeling of plant tissue growth and development. Roeland M.H. Merks*, Michael Guravage, Dirk Inze, and Gerrit T.S. Beemster. Plant Physiology February 2011.

Thursday, July 25, 2013

Configuring an IGV Genome Server

This one consumed about 2 hours before I finally figured out the problem; it was simply a mistake in compressing the files. Thanks to Precocious Llama and another lab colleague for finding and troubleshooting the issue with me!

The Integrated Genome Viewer is a genome browser tool put out by the Broad Institute. When trying to host a genome for our lab to view, I ran into some problems. My mistake was in compressing the .genome file. What worked for me was:

Make a new directory. Copy a reference genome and GFF3 annotation to it. Use the IGV client to make the .genome file (instructions found here) from the copied reference. Decompress the resulting .genome file and configure the property.txt file (instructions found here) within. Of note, the sequenceLocation field is not a directory as stated by the instructions, but a web accessible path to the reference fasta (e.g. http://www.yoursite.com/IGV/genome/sbi1.fasta).

My issue was I kept zip compressing the directory containing the files (property.txt et al.). Instead, the zip archive needed to be made of all of the files within the directory (and renamed to id.genome as per the instructions).

Once I correctly archived the files, everything worked fine.