Wednesday, October 30, 2013

Communicating science without undermining its complexities: different sizes of infinity

I just came across this YouTube video that I believe explains different sizes of infinity in a very understandable way.The video does so without the introduction of diagonalization (when Professor Harold Boas first introduced this in my first "real math" class it blew my mind), which I felt was instrumental in my understanding of the different sizes of infinity.

However, this video reminds me that a critical aspect of practising science is the ability to communicate its importance without requiring extensive background knowledge. In my opinion, this video does this very well without undermining the complexity of the problem.

   

Monday, October 28, 2013

Comment and annotate PDF in Ubuntu (12.10)

I finally found a somewhat satisfactory way to comment and annotate PDF files in Ubuntu after trying a few of the solutions suggested by the interwebs. A piece of software called foxit can be run through wine.

I installed wine a while back, but I don't remember if I did anything special. If I recall correctly, I installed wine using these directions at the Ubuntu site.

After downloading the foxit exe from the foxit site, I just ran it with wine and installed using all the defaults.

wine FoxitReader606.0722_enu_Setup.exe


Oh glorious day.

Thursday, October 24, 2013

Open terminal to current working directory from nautilus (Ubuntu 12.10)

Credit goes to this Ask Ubuntu forum post.

There is a package that enables right clicking nautilus' directory window and choosing to open a terminal with the working directory set to the directory in the nautilus window.

Install the package via the package manager and then restart nautilus. 
sudo apt-get install nautilus-open-terminal
nautilus -q


Then right click and choose open in terminal.

Saturday, October 19, 2013

awk and GNU parallel: problems with quotes

Embarrassing as it is to admit, I spent about two hours trying to work out how to parallelize an awk command with GNU parallel. I think the conclusion is that I don't understand quotes in bash as well as I should.

My goal was to run the awk code from my previous post in parallel. Credit goes to this post on Stack Overflow for getting me to the solution.

One of the answers in the Stack Overflow post suggested storing the awk command in a string, and that worked for me:
#!/bin/bash

awk_body='(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)}
          (substr($0, 1, 1) == "@") || ($0 == "+") {print $0}'

ls *.fastq | parallel --gnu -j-2 --eta "cat {} | awk '$awk_body' > trimmed{}"

Kudos to Ole Tange for GNU parallel.

Edit:

Ole Tange stopped by and left some good pointers in the comments:

The GNU Parallel man page has a section dedicated to quoting: http://www.gnu.org/software/parallel/man.html#quoting

Often you can simply add \' around every '.

Or use --shellquote to generate a quoted version of your string.

For readability you might want to look into writing a Bash function instead: http://www.gnu.org/software/parallel/man.html#example__calling_bash_functions

Thanks for the suggestions, Ole!


Friday, October 18, 2013

Journal club: Ontologies to hypotheses. UhLAWLNo

The design for data generation and the actual data generated seem useful and adds to the field of characterizing totipotent cells. The authors' build bridges from gene ontologies of hierarchical clusters to hypotheses with little to no support in between, for example when they found that ribosomal protein genes were enriched in four clusters, implying differential expression, they hypothesize that "ribosomal equipment needs to be precisely regulated during re-entry in to the cell cycle". But, what if this is just peripheral information that is non-essential to the gene state, and that even when it is not precisely regulated the cell can still enter the cell cycle. The analyses seems peppered with weakly supported hypotheses.


Comment
I think the authors' story was poised to take better advantage of the many similar data sets generated by other labs. And, I think that had they made better use of the other datasets, they could have
  1. potentially given their hypotheses more support, and 
  2. I think more importantly, make a stronger case for their dataset's contribution to the field.
Edit
During journal club, another student asked why they didn't collect meristem cells if that is essentially the state they are trying to assay; and it dawned on me that these protoplasts may be an "erased" state, and have no tissue identity, not even as quiescent cells. This leads back to Dr. Maggert's question, with respect to epigenetics, if there is a temporal/spatial difference between "erasure" and "establishment", and I think that this could be an example of a cell that has been "erased", but not yet "established". I think that Dr. Maggert would 100% argue that this event falls into the category of development, which he defines as mutually exclusive to epigenetic. 

I appreciate Dr. Devarrene's insights in Journal club. I think they add value to learning how to critique science.

Journal club: On data generation and analysis

This week's paper: Characterization of the early events leading to totipotency in an Arabidopsis protoplast liquid culture by transcript profiling by Chupeau et al. (2013)

I think the experiment is a good idea, and the data could be informative; unfortunately I don't think the authors had the expertise to analyze the data they generated. They probably would have been better off adding another author or two and passing the data to a collaborator with more bioinformatic experience; they may have been able to make a more cogent story.

My observation has been that some (many?) life scientists are quite antagonistic towards bioinformaticians/computational scientists; I've also met some on the computing side of the fence who are quite antagonistic towards the life scientists. The life scientists think the computional folk just push buttons and use meaningless statistics or models. The computional folk think the life scientists are just observational scientists who don't appreciate scientific rigor.

My comment:
"I think the experiment behind this paper is worthwhile and that the methods and data will be useful. However, I also think that the paper demonstrates a common issue in modern genomics research, which is that our data generation capacity frequently outpaces our technical ability to analyze the data in meaningful ways."

Tuesday, October 15, 2013

Modeling of convection

The Llama and I consider Professor Wolfgang Bangerth to be one of our intellectual and professional role models. He allows the Llama and I to attend his research group meetings, and he's a member of my advisory committee. He just posted a beautiful video depicting convection solved using the ASPECT and deal.ii software libraries that he develops for.

   

The "About" section from the video:
"Convection in a box with heated bottom and cooled top. Solved with the finite element code ASPECT (http://aspect.dealii.org) using the 3dbox.prm input file set up by Timo Heister. Simulation by Wolfgang Bangerth on 64 cores. The simulation has ~10M unknowns and runs for ~10,000 time steps. Further discussion in the Aspect manual."

Sunday, October 13, 2013

Code golf

I recently discovered code golf. Code golf is basically a competition to write a program that solves a given problem in the minimum number of characters (or memory) possible. In general, a problem is presented and then the golfers propose their solutions with character counts. I think that the formal competitions are limited to particular languages, but the code golf on the Stack Exchange usually has solutions in many different languages.

I haven't golfed myself, but reading the golfing of others always gives me a good larf. I particularly enjoy the C/C++ golfers.  As an example, this post on the Stack Exchange has an C++ implementation of Dijkstra's Algorithm for pathfinding in 799 characters.

Friday, October 11, 2013

Journal Club: Is statistical significance always biological significance?

Edit: The journal club presenter stated that the moths lay ~6 to 8 eggs, so perhaps the 1 egg difference mentioned below is meaningful.

This weeks journal club article: Feeding-induced rearrangement of green leaf volatiles reduces moth oviposition from Allmann et al. (2013).

Today marks the last green leaf volatile paper for the semester.

This was the first time I've read a paper from the eLIFE journal. I think I liked the journal's layout. eLIFE has made a bit of a splash, and I think it's one of many that will gain popularity as the open access movement continues to gain momentum. I expect that the future of scientific literature will be more of an open-access, crowd-sourced effort where peer review is done in the comments section of a publication by hundreds of scientists; search engines will allow relevant papers to be searched for, and crowd review efforts will determine what is accurate and important and let the garbage filter to the bottom.

Will this be better than what we have now? I have no idea. I just don't believe that the traditional publication system is capable of disseminating the tremendous volume of information that the world's research institutions can now generate.

As for the paper, the authors reported many statistically significant differences for the differential production of the green leaf volatile (GLV) isoforms in response to the M. sexta oral secretions (OS), and they reported that a behavioral difference in oviposition occurred because of reception of one of the particular GLV isoforms. While I don't doubt the differences they found were statistically significant, I'm not convinced that they are actually of any biological significance.

In their field trials, the same experiments were not statistically significant in the first trial, but they were in the second (although the visual representation of the data aren't very convincing that a difference exists for the second trial).

The calcium activity in the antennal lobe data were more convincing that there is a difference in perception of the two isoforms, although it doesn't demonstrate that a difference in perception translates to a behavioral difference.

Finally, I think Figure 7 was supposed to demonstrate the key point of the paper: behavioral changes based on differential perception of the two isoforms. However, while they did again see significant differences, it looks like there was only ever an average difference of one egg oviposited between the two treatments. I think it's potentially misleading that they're testing that $\mu_1 - \mu_2 \neq 0$ rather than testing $\mu_1 \neq \mu_2$ since, as reported, I have no clue how many eggs these things lay. Is it a difference between 100 and 101? or 2 and 3? As such, I'm not very convinced that this paper has shown anything new.

 My comment:
"I thought this paper's hypothesis was fun, but I didn't find the data very convincing. If I'm reading it correctly, Figure 7 reports that there is an average difference of just one egg between the two isoform treatments. I have no idea how many eggs these moths lay, and I'm not convinced that the one egg difference is demonstrative of a behavioral difference as a result of differential interpretation of the two isoforms."

Journal club: neuroscience entomology

Feeding-induced rearrangement of green leaf volatiles reduces moth oviposition

The author's painted a very intriguing dynamic of green leaf volatiles (GLV). Typically, from my sample size of 2 other GLV papers from journal club, I think of GLV as chemicals released by the plant in response to damage in order to recruit predators of the organism causing damage.

This paper takes a slight twist in that the GLV is constitutively released by the plant, and it is the caterpillar that changes the isometry of the GLV which in turn attracts the caterpillar's predator! Moreover, the new isomer is also detected by the moth (caterpillar's next life stage) to oviposit (lay eggs) on a different plant as it is more likely that the new isomer indicates predators that could eat the moth's eggs.

Unfortunately, I'm not sure the authors found enough supporting evidence. While they found that the antennal lobe seems to be able to differentiate different isomers, they didn't convincingly show that it was consequential, changes in ovipositon due to different isomer ratios.

Comment:
With respect to the calcium responses in the antennal lobes, the authors mention that they cannot directly compare individuals due to not having a physiological map of the antennal lobe. Given that, I couldn't work out how they determined the regions of interest (ROIs). If they couldn't define regions before they saw the calcium fluorescent fluxes based on some physiological, did they do it based on the flourescent time course after treatment? Region 3 and 4 seem pretty close, and it seems it could significantly change results if they misplace their region of detection.

Wednesday, October 9, 2013

Using awk: removing restriction site sequence from reads in fastq format

I had a fastq file in which the sequence reads contained the restriction enzyme site from RAD-seq reads. In order to use them with reads that had the site trimmed off already, I needed to get rid of the RE site.

The fastq file was of the format:
@1_1101_11092_1965_1
CCGGCACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT
+
HIHHIIHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC
@1_1101_11396_1963_1
CCGGCGACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA
+
HJJIJJJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD

I used the substring() function of awk to get rid of the first 5 characters (corresponding to the RE site) in each sequence or quality line as so:
#!/bin/bash
awk '(substr($0, 1, 1) != "@") && ($0 != "+") {print substr($0, 6)}
     (substr($0, 1, 1) == "@") || ($0 == "+") {print $0}' input.fastq > output.fastq

Output sans RE site:
@1_1101_11092_1965_1
ACAGGCCTTGGAGGCGCTGGAGGCACACGANGTGGCACGTTGATCTGTGTGTGAGGCGCCGCGGGTTTGTGAATGTGGACCGGACTGCTTCTGCTGCT
+
IHGGH;FHIIIIGIGIIIIIBGHEIHHHED#,,;?CC@BBBBBCCECCC:?5?>CBBB@B9BBBB08?@??3:@@:CC>@?BB39<@C@CAC::>@AC
@1_1101_11396_1963_1
GACTCATAGGCAGTGGCTTGGTTAAGGGAANGGAACCCACCGGAGCCGTAGCGAAAGCGAGTCTTCATAGGGCGATTGTCACTGCTTATGGACCCGAA
+
JJJJJJJJJJJJJGIJJJJJJGIIJEIIHH#-;BEFDDDDDDDDDDDDDDDBDDDDDD<BB>CDEDEEDDDDDD@DBDCDEDDDDDDDDCDCDDDDBD

Monday, October 7, 2013

Using awk: bitwise operations and string manipulation

I learned a few new things trying to reformat a sam file with awk. I also learned that what I was attempting to do wouldn't ultimately solve my problem, but I wanted to keep what I learned about bitwise operations and other string manipulations in awk here.

To test for the presence of a certain bit, I had to use GNU awk (gawk) since the awk on my system didn't have these functions, and I used the and() function (see here for bitwise operation documentation; see here and here for additional sam bit flag information).

I used substring() for string manipulation and length() to access elements from the end of the string (credit goes to here and here).

I used printf when I wanted to avoid generating a newline character (credit goes to here).

The original goal was to trim restriction sites from RAD reads and quality scores, but this doesn't account for the CIGAR string or the alignment coordinate so it ultimately doesn't work. Here is the gawk script in its entirety (with some comments):

#!/bin/bash
#Skip the header lines
gawk '(substr($0, 1, 1) != "@") {
        #Test if the bit flag has the 4th bit set (to indicate a read on the reverse strand using the and() function.  
        if (and($2,0x10)) {
                #Don't adjust the first 9 columns.
                for (i=1; i<=9; i++) {printf "%s\t" , $i}
                #Trim 5 bases from the end of the string and the quality score
                printf "%s\t", substr($10, 1, length($10)-5)
                printf "%s\t", substr($11, 1, length($11)-5)
                #Print any remaining columns; NF is a special variable containing the number of fields. 
                for (i=12; i<NF; i++) {printf "%s\t", $i}
                printf "%s\n", $NF
        }
        #Else it is not a reverse strand read; trim 5 bases from the start.
        else {
                for (i=1; i<=9; i++) {printf "%s\t" , $i}
                printf "%s\t", substr($10, 6)
                printf "%s\t", substr($11, 6) 
                for (i=12; i<NF; i++) {printf "%s\t", $i}
                printf "%s\n", $NF
        }
}
#Include the header lines.
(substr($0, 1, 1) == "@") {print $0}
' input.sam

Saturday, October 5, 2013

Journal Club: short ORFs and gene prediction

This week's journal club paper: Small open reading frames associated with morphogenesis are hidden in plant genomes by Hanada et al. (2013).

The implications of the idea that many functional, small open reading frames (sORFs) exist in eukaryotic genomes are pretty exciting. The paper didn't entirely convince me that this was the case, but I was sufficiently persuaded that it merits further investigation.

I didn't see any major red flags with the paper. I would have preferred biological replicates for their sORF arrays over technical replicates. Additionally, overexpression mutants can cause all kinds of wacky effects that aren't necessarily a consequence of the functional properties of the overexpressed genes; I'm not sure concluding that overexpression mutants of sORFs causes phenotypic consequences more frequently than random chosen genes is going to hold up to further scrutiny (maybe sORFs accumulate more effectively than regular protein products to cause strange side effects). However, again, I think it's sufficient to merit further investment in the topic, such as the knock-down/knockouts they propose.

My comment:

"I think it's worth continuing to pursue study of these small ORFs as the authors propose. They may have already shown this in one of their previous publications, but I'm interested in the properties of these small ORFs in the context of gene prediction. They said they used hexamer composition bias for prediction; are there other properties that might be useful for prediction, like CpG island promoters?"

Friday, October 4, 2013

Journal club: uORFs, eukaryotic "operon"?

Paper: Small open reading frames associated with morphogenesis are hidden in plant genomes. (Hanada et al 2013).

Overall, I liked the paper. It was the first time I had been made aware of the abundance of un-annotated small open reading frames, and I think it adds awareness in the field to something new, a method of detection, and an example of its applications.

Comment 1: Their investigation of expression of both RNA and protein reminded me that we often make assumptions about the relationships between transcript abundance and protein abundance; even if we detect translation, it doesn't guarantee that the we can detect transcription accumulating, and vice versa.

Comment 2: I had never heard of upstream ORFs; they sound conceptually similar to operons/polycistronic DNA found in prokaryotes, so I thought it was interesting that they exist in a eukaryotic system.

Comment 3: The paper described some techniques (controls) that I had never heard of before that I thought were interesting and useful. For example, determining the null distribution of the Ka/Ks ratio with random sequences to obtain a neutrality baseline to identify sORFs under selection was a new idea to me.

Comment/Question 4: What tissue types were used for the 17 different environmental conditions? They refer to seedling, but what tissue on the seedling?

Comment 5: They made a mention that other organisms such as humans contain sORFs, and specifically they reference the ENCODE project. I was wondering what ENCODE used to identify these sORFs; was it a similar algorithm, and if not, maybe it could be used to identify more un-annotated genes?

Comment 6: I'm a bit skeptical of their claim that their prediction algorithm has the best performance of identifying small ORFs. I agree that it performed well and was able to identify a significant number of previously un-annotated genes, however their training set to obtain their a prior hexamer composition bias was done in the same organism, Arabidopsis, that they go on to test in. I question whether it would perform as well in other organisms.

Wednesday, October 2, 2013

awk to transpose (super fast)

Credit goes to Stack Overflow - Transpose a file in bash. I was looking for a fast way to transpose a tab separated variable file, file.tsv, and so per instructions, I created the following shell script, transpose.sh:
#! /bin/bash
awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j] 
        for(i=2; i<=NR; i++){
            str=str"\t"a[i,j]; 
            # delimitting variable can be replaced (i.e "," or " " or "\t")
        }
        print str
    }
}' file.tsv

Then, I ran transpose.sh through the command line and printed the transposed file.tsv to transpose_file.tsv
 transpose.sh > tr_file.tsv