Tuesday, July 30, 2013

Rename error in Linux VirtualLeaf tutorial: "Bareword "tutorial0" not allowed while "strict subs" in use at (eval 1) line 1."

I'm going through the VirtualLeaf tutorial, and while it has options for both Windows and MacOSX/Linux, it does not always work for me.

When trying to "rename all the remaining files ..." (step 4 page 339 of Building Simulation Models of Developing Plant Organs Using Virtual Leaf)

I got the following error:

$ rename tutorial0 mymodel *
Bareword "tutorial0" not allowed while "strict subs" in use at (eval 1) line 1.

This is because rename syntax is different for Ubuntu in which it is expecting
$ rename 's/search/replace/;' file1 [file2 file3...]

So, I used the following
$ rename 's/tutorial0/mymodel/;' *

and checked it with
$ ls

to see that it indeed had renamed my files! I found the answer to this here.

Nature News article: Quantum boost for artificial intelligence

The original article is posted on the Nature website here. I don't understand it, but quantum computing sounds amazing if it actually works.

Monday, July 29, 2013

Beginnings with Git

I've heard many times before that version control is important, especially as soon as more than one developer is working on a project. Since Precocious Llama and I are likely to be working on the same code in the future, I figure it's about time to bite the bullet and learn a way to do version control.

After some research, it looks like Git in conjunction with GitHub will serve our purposes best.

GitHub has some nice tutorials for getting started (here), as does Git (here). Lots to learn, and it definitely looks useful as things start to scale up.

Lastly, the Git repository we're playing with can be found here.

Advice from a systems software engineer

Precocious Llama's father spent most of his career working in R&D at HP developing systems software. This weekend, he stated that in making a hiring decision, he would expect any entry level computer scientist to have some background in the following topics:
  1. Data Structures
  2. Algorithms
  3. Operating Systems
  4. Networking/Communications (know at least one network protocol well)
He also stated that it was one thing to know theory, and another thing altogether to be able to implement code to solve a problem. He said he expected that an interviewee would have some kind of portfolio to demonstrate that the individual can be presented with a problem and successfully code and test a solution.

Friday, July 26, 2013

compile errors in Linux VirtualLeaf installation

I will want to remember this for my next installation of VirtualLeaf (Merks et al. 2011).

For the linux installation from source code VirtualLeaf-v1.0.1-src.zip :

I followed the instructions found in Building Simulation Models of Developing Plant Organs Using VirtualLeaf by Merks and Guravage. One first needs to install Qt from qt-project.org per the instructions. Then download and extract VirtualLeaf.

Before compiling... you will need to modify the mesh.h header file (/home/precociousllama_youruser/VirtualLeaf-v1.0.1-src/src/mesh.h)
You can open mesh.h in any editor, and find the following:

  template<class op=""> void RandomlyLoopNodes(Op f) {

    MyUrand r(shuffled_nodes.size());
    random_shuffle(shuffled_nodes.begin(),shuffled_nodes.end(),r);

    for (vector<node>::const_iterator i=shuffled_nodes.begin();
  i!=shuffled_nodes.end();
  i++) {
      f(*shuffled_nodes[*i]);
    }
  }

  template<class op=""> void RandomlyLoopCells(Op f) {

    MyUrand r(shuffled_cells.size());
    random_shuffle(shuffled_cells.begin(),shuffled_cells.end(),r);

    for (vector<cell>::const_iterator i=shuffled_cells.begin();
  i!=shuffled_cells.end();
  i++) {
      f(*shuffled_cells[*i]);
    }
  }

This will cause errors in compilation as pointed out by the developer Merks (issue 6).

Comment this section out by adding "/*" at the beginning and "*/" at the end. So, that it looks like:
 /*
  template<class op=""> void RandomlyLoopNodes(Op f) {

    MyUrand r(shuffled_nodes.size());
    random_shuffle(shuffled_nodes.begin(),shuffled_nodes.end(),r);

    for (vector<node>::const_iterator i=shuffled_nodes.begin();
  i!=shuffled_nodes.end();
  i++) {
      f(*shuffled_nodes[*i]);
    }
  }

  template<class op=""> void RandomlyLoopCells(Op f) {

    MyUrand r(shuffled_cells.size());
    random_shuffle(shuffled_cells.begin(),shuffled_cells.end(),r);

    for (vector<cell>::const_iterator i=shuffled_cells.begin();
  i!=shuffled_cells.end();
  i++) {
      f(*shuffled_cells[*i]);
    }
  }
*/

Save mesh.h and continue with the compilation procedure.

If you get an error that says something along the lines of "qt3"

Check the versions of your qmake and qmake-qt4. You can do so in the Ubuntu command line with:

$ qmake -v

and

$ qmake-qt4 -v

and make sure the versions are the same. Then, try again!

I haven't done anything but look at the pretty pre-loaded models, but it's going to be awesome, for sure. Thanks Merks and Guravage!

Their original paper is titled: VirtualLeaf: An open-source framework for cell-based modeling of plant tissue growth and development. Roeland M.H. Merks*, Michael Guravage, Dirk Inze, and Gerrit T.S. Beemster. Plant Physiology February 2011.

Thursday, July 25, 2013

Configuring an IGV Genome Server

This one consumed about 2 hours before I finally figured out the problem; it was simply a mistake in compressing the files. Thanks to Precocious Llama and another lab colleague for finding and troubleshooting the issue with me!

The Integrated Genome Viewer is a genome browser tool put out by the Broad Institute. When trying to host a genome for our lab to view, I ran into some problems. My mistake was in compressing the .genome file. What worked for me was:

Make a new directory. Copy a reference genome and GFF3 annotation to it. Use the IGV client to make the .genome file (instructions found here) from the copied reference. Decompress the resulting .genome file and configure the property.txt file (instructions found here) within. Of note, the sequenceLocation field is not a directory as stated by the instructions, but a web accessible path to the reference fasta (e.g. http://www.yoursite.com/IGV/genome/sbi1.fasta).

My issue was I kept zip compressing the directory containing the files (property.txt et al.). Instead, the zip archive needed to be made of all of the files within the directory (and renamed to id.genome as per the instructions).

Once I correctly archived the files, everything worked fine.


Tuesday, July 23, 2013

On the number of markers and the number of individuals

This thread passed through the Stacks mailing list, and it covered some things I hadn't really considered all that thoroughly before. I'm posting it here so that I can remember to consider it some more.

The original question posted by a user named Anpn:

Hi,

Sorry, this question is way off topic. Since, the question is about linkage
mapping, it might be of interest to some.

Does anybody have suggestions for finding the right number of markers for a
particular number of individuals?

If i have 300 RAD markers on a genome that is say 200 CentiMorgan long in
total, how many (minimum and maximum) F2 individuals should i use to get each
chromosome in one linkage group?

My understanding is that one needs more markers as the number of individuals
increases(number of recombination events increases), or is this not the
case?

Thanks,
Anpn

And the response from Julian Catchen (the developer of Stacks), with a piece from a colleague of his named Angel:

Hi Anpn,

Ideally you want to have one or more markers on either end of a recombination
event. So, as the number of individuals in your cross increases you will get
more recombination events and hence will need more markers. However, using RAD
the number of recombination events should easily be the limiting factor, not the
number of markers.

Given one recombination event per chromosome per progeny you could roughly
estimate the number of markers you need. And since you are have an F2 cross, you
have another set of recombination events in the second generation. So, one
progeny, one recombination = two markers. One progeny, two generations and
therefore two recombinations = three markers.

 From my linkage mapping colleague, Angel:

> You don't "need" more markers per se when you increase the number of
> individuals, but if you have a limited number of markers, past a certain
> number of individuals you won't get any more resolution and therefore the
> extra effort is wasted. If you get 1-2 recombination events per individual
> per chromosome and you have a couple hundred markers, with a couple hundred
> individuals you could potentially have a recombination event between any
> given pair of markers and then be able to order every single marker in every
> LG. In that case, additional individuals would have additional recombinations
> but those additional recombinations will only show up as additional distance
> between markers. SO if you have 10 markers in a LG, you'll be able to order
> them 1-10, but after that the order is established.

> If you have a limited number of progeny (almost always the case) but a large
> number of markers, you'll still be able to link and order the markers, but in
> that case you won't be able to resolve all the markers because of the lack of
> recombination events. So in the same case of before, if you have 100 markers,
> you could have them in the same order as before but now with 10 markers at
> any given point.

> In this case, additional progeny could supply additional recombination events
> that would further resolve those groups of markers, so if you get 10 more
> recombinations you could now have 20 groups of 5 markers each. Given enough
> recombinations you could ideally resolve all 100 markers in that LG, the
> assumption is always that the recombination events and the markers are
> distributed equally along the genome, and that's seldom the case. After a
> certain number of progeny you start getting diminishing returns and most
> recombinations would occur around the same places and you probably wouldn't
> be able to resolve the ends of the LG.

> When there are recombination hot spots or just places with more
> recombination, having more markers will increase the chances than one marker
> will expand the gap and link things together. With more individuals you'll
> also be able to increase the LOD and be sure that things are really linked
> together.


Best,

julian


Monday, July 22, 2013

Determine size of folders and files in directory

Credit goes to this post.

I'm frequently interested in finding how much memory the files or directories within a directory are consuming. To find it, I'll use:
du -sh ./*

Sunday, July 21, 2013

Extracting sequence information from Stacks samples


TL;DR: On the Stacks mailing list, Ryan Waples posted what looks to be a useful bit of Python code for extracting sequence information from Stacks output. With his permission, I've reposted some of the dialogue and Waples' code here.

And for the full post:

An email passed through on the Stacks listserve from user named Scott Hotaling who had requested the following:

"What I'm after is essentially the X.haplotypes.tsv file that populations outputs with the full sequence context instead of only the variable position(s). I understand this information could be determined manually via catalog.snps.tsv file. But, it would be great to have the full sequence information as an output option of populations."

In response, Ryan Waples replied with the following:

Scott, 

You can try to use the attached python script.   It is not a polished release but I think it should be able to get something close to what you want.

It looks in the catalog and individual files and constructs full fasta seqeunces from the individual haplotypes and the catalog's consensus.  I haven't used the data from this myself, so please double check this is doing what you expect.

You can use it with a command line as follows:
python   stacks_to_fasta_matches.py   stacks_path   catalog_name   individual_name  subset_file   FASTA_file

stacks_path  =  path to the stacks data you want to use, must contain the catalog and individual files [/stacks/data/]
catalog name = name of the catalog [batch_3.catalog]
individual name = name of ind [PHOOD_2131]
subset_file  = A list of catalog IDs, one per line, A fasta entry will be generated for each haplotype of the individual at these loci [./subset.txt]
FASTA_file =  the output. [./output.fa]

For 'consensus' loci, just a single sequence will be printed, not two copies.  This script doesn't check the 'model' calls in the tags files of the individuals, and I think it would spit out more than two seqeunces if more than two haplotypes were seen in an individual.

Good luck,

-Ryan" 

It may be of interest to do something similar with our data in the future, so I wanted to keep the code available. It also employs a few practices that I'd like to learn from. Thanks again to Ryan Waples for allowing it to be reposted here. Here is his code:

#!/usr/bin/env python
# Python 2.7
# stacks_to_fasta_matches.py v1.1
# 5/21/13
# Ryan Waples



# Command line outline
# > python   stacks_to_fasta_matches.py   stacks_path   catalog_name individual_name  subset_file   FASTA_file

# Usage example
# > python ./stacks_to_fasta_matches.py /stacks/data/ batch_3.catalog   PHOOD_2131  ./subset.txt  ./output.fa


# Inputs
 # A set of stacks catalog and individual files (*.tags.tsv, *.snp.tsv, *.alleles.tsv)
  # specified by a [path] + [catalog_name] + [individual_name]
 # A 'whitelist' of tags.
  # specefied by a [filename]
  # tags_subset_file  format - one tag_id per line.
   # 1
   # 43
   # 12395
   # 6549
   # etc..

# Outputs 
 # An FASTA file 
  # For each white listed tag:
   # Each haplotype present in the catalog will generate one FASTA sequence entry.
  # FASTA header: 
   # >[TAG#]_[ALLELE]_[SNP1_pos]-[SNP2_pos]-[SNP3_pos]-[...]

import sys

def stacks_to_fasta_matches(stacks_path, catalog_name, individual_name, subset_file, out_file):
 """
 
 Takes a whitelist of catalog tags and prints each observed haplotype in the supplied indiviudal
 into a sequence to a FASTA file.
 
 Function also returns a dictionary of FASTA entries.
 
 """
 print "\n"
 if not stacks_path.endswith('/'):
  stacks_path = stacks_path + "/"
 print ("\n".join(['stacks_path ->\t' + stacks_path, 
      'catalog_name ->\t' + catalog_name, 
      "individual_name ->\t" + individual_name,
      'subset_file ->\t' + subset_file, 
      'out_file ->\t'+ out_file]))
 print "\n" 
 
 #used variables
 tags_to_keep  = set()  # strings
 seq_of_tag  = dict() # strings
 snp_pos_of_tag = dict() # lists of ints
 alleles_of_tag = dict() # lists of strings
 ind_ids_of_cat_id = dict() # set of strings
 FASTA_dict  = dict() # strings
 


 with open(subset_file, "r") as SUBSET:
  for line in SUBSET:
   tag = line.strip()
   tags_to_keep.add(tag)

 with open("{0}{1}{2}".format(stacks_path, catalog_name, ".tags.tsv" ), "r") as TAGS:
  for line in TAGS:
   columns = line.strip().split("\t")
   id = columns[2] 
   seq = columns[9]
   seq_type = columns[6]
   if (id in tags_to_keep and seq_type == 'consensus'):
    seq_of_tag[id] = seq

 with open("{0}{1}{2}".format(stacks_path, catalog_name, ".snps.tsv" ), "r") as SNPS: 
  for line in SNPS:
   columns = line.strip().split("\t")
   id = columns[2]
   pos = columns[3]
   if id in tags_to_keep and id in seq_of_tag:
    snp_pos_of_tag.setdefault(id, [])
    snp_pos_of_tag[id].append(int(pos))
   

 with open("{0}{1}{2}".format(stacks_path, individual_name, ".matches.tsv" ), "r") as MATCHES:
  for line in MATCHES:
   columns = line.strip().split("\t")
   cat_id = columns[2]
   ind_id = columns[4]
   allele = columns[5]
   if (cat_id in tags_to_keep and cat_id in seq_of_tag):
    alleles_of_tag.setdefault(cat_id, [])
    alleles_of_tag[cat_id].append(allele)
    ind_ids_of_cat_id.setdefault(cat_id, set())
    ind_ids_of_cat_id[cat_id].add(ind_id)
       
 with open(out_file, "w") as FASTA:
  for cat_id in sorted(alleles_of_tag.keys(), key=int):
   base_seq = seq_of_tag[cat_id]
   haplotype = base_seq[:]
   for allele in alleles_of_tag[cat_id]:
    snp_index = 0
    try:
     for snp in sorted(snp_pos_of_tag[cat_id]):
      haplotype = haplotype[:snp] + allele[snp_index] + haplotype[snp+1:]
      snp_index += 1
    except KeyError: # happens on 'consensus' genotypes
     haplotype = haplotype
    try:
     snp_positions = "_".join([str(x) for x in snp_pos_of_tag[cat_id]] )
    except KeyError:
     snp_positions = "none"
    ind_id = "_".join(ind_ids_of_cat_id[cat_id])
    header = ">cat:{}|ind:{}|allele:{}|pos:{}".format(cat_id, ind_id, allele, snp_positions)
    FASTA_dict[header] = haplotype
    header_line = header + "\n"
    seq_line = haplotype + "\n"
    FASTA.write(header_line)
    FASTA.write(seq_line)

 return FASTA_dict 


def main(argv=None):
 if argv is None:
  argv = sys.argv[1:]

 stacks_to_fasta_matches(*argv)
 
if __name__ == "__main__":
 main(sys.argv[1:])


Thursday, July 18, 2013

Search and replace in a file without opening it

Occasionally I'll need to perform a search and replace in a file so large that I can't reasonably open it in Vim. Using Perl has performed well for me in cases like this:
perl -pi -e 's/toFind/toReplaceWith/g' filePath
For example, sometimes I need to change the names of scaffolds in a large SAM file:
perl -pi -e 's/chromosome_//g' ./example.sam

Wednesday, July 17, 2013

Apache error log location in Linux Ubuntu 12.10

I can never remember where the Apache error log is on my system. This is for Apache 2 installed via the apt package manger on Ubuntu 12.10:
/var/log/apache2/error.log

Monday, July 15, 2013

Compression of files in parallel using GNU parallel

*Update*

In the comments, Ole Tange (GNU Parallel's author!) himself weighed in with a much more elegant method than what I previously wrote.
parallel --gnu -j-2 --eta gzip ::: *.fastq

parallel --gnu -j-2 --eta gunzip ::: *.fastq.gz
And pointed out the helpful bibtex option:
parallel --gnu --bibtex
Which returns the bibtex citation:
@article{Tange2011a,
 title = {GNU Parallel - The Command-Line Power Tool},
 author = {O. Tange},
 address = {Frederiksberg, Denmark},
 journal = {;login: The USENIX Magazine},
 month = {Feb},
 number = {1},
 volume = {36},
 url = {http://www.gnu.org/s/parallel},
 year = {2011},
 pages = {42-47}
}


*Resuming previous post*

I'm very fond of GNU parallel for making it easy to parallelize commands. Here is an example of using it to parallelize gzip compression and decompression of a set of fastq format files. The --gnu option designates that we want to use GNU parallel, and the -j-2 option designates that we want to use 2 less than the number of available processors.
#!/bin/bash
#Gzip compress files with GNU parallel

#Use this for compression
ls ./*.fastq | parallel --gnu -j-2 --eta \
'gzip {}'

#Use this for decompression
ls ./*.fastq.gz | parallel --gnu -j-2 --eta \
'gzip -d {}'
Helpful documentation for GNU parallel can be found here.

Saturday, July 13, 2013

In silico restriction digest and retrieval of restriction site associated DNA

My initial goal was to be able to use Biopieces to perform an in silico restriction digest of an entire genome and retrieve the DNA sequence flanking restriction sites. Biopieces performed well for scanning the genome and identifying locations with rescan_seq, but digest_seq did not scale well to the whole chromosome; digest_seq consumed over 16GB of RAM on a chromosome of ~70Mbp.

Just counting the number of restriction sites was fairly rapid (requires Biopieces, a fasta file with the reference genome, and a restriction enzyme from the rescan_seq list):
#transliterate_seq replaces N's with A's so that the sequence can be properly scanned for restriction sites.
#Without converting, rescan_seq is a bit greedy and will consider a string of N's a possible restriction site.
#rescan_seq in this example scans for the FseI sites.
read_fasta -i sbi1.fasta | transliterate_seq -s 'Nn' -r 'aa' | rescan_seq -r FseI -o | awk ' BEGIN{i=0} $1=="RE_COUNT:" {i=i+$2} END{print i} '
But getting the sequence was a little more challenging due to the memory issues of digest_seq. Ultimately I orked together a pipeline to extract the sequence, albeit very slowly at this point (~1 hour to get the sequence for 40k+ RAD tags from 20k+ RE sites).
The pipeline requires Biopieces, a fasta file containing the reference genome, (in my case the Sorghum bicolor reference genome build Sbi1), and a restriction enzyme from the rescan_seq list.
#transliterate_seq replaces N's with A's so that the sequence can be properly scanned for restriction sites.
#Without converting, rescan_seq is a bit greedy and will consider a string of N's a possible restriction site.
#rescan_seq in this example scans for the FseI sites.
#We use awk to avoid capturing the very long "SEQ:" field that is output by rescan_seq since we don't need it.
read_fasta -i sbi1.fasta | transliterate_seq -s 'Nn' -r 'aa' | rescan_seq -r FseI | awk ' $1!="SEQ:" {print;} ' > FseIout.txt
At this point, FseIout.txt contains information on the location of matches for the restriction sites in the genome. I wrote a script in Python to take this output and retrieve the restriction site associated DNA. It's quite slow, and I think the bottleneck is the system calls to get_genome_seq for each match:
#!/usr/bin/python
# -*- coding: utf-8 -*-
#Ryan McCormick 07/12/13
#Texas A&M University
#For use in conjunction with a Biopieces pipeline
#Version 0.1

import sys
from subprocess import check_output

#Modify these variables based on genome name, desired length of RAD tag, and RE site.
str_genome = "Sbi1"
int_len = 10
int_lenREsite = 8
###

try:
    li_tsv = [line.strip() for line in open(sys.argv[1])]  # Open target file, strip newline characters from lines, define as list.
except IndexError:  # Check if arguments were given
    sys.stdout.write("No arguments received. Please check your input:\n")
    sys.stdout.write("\t$ python.py input.tsv\n")
    sys.exit()
except IOError:  # Check if file is unabled to be opened.
    sys.stdout.write("Cannot open target file. Please check your input:\n")
    sys.stdout.write("\t$ python.py input.tsv\n")
    sys.exit()

lili_allContigs = []
li_contig = []

for i in range(len(li_tsv)):
    if li_tsv[i] == "---":
        lili_allContigs.append(li_contig)
        li_contig = []
    else:
        li_contig.append(li_tsv[i])

int_numREsites = 0
for i in range(len(lili_allContigs)):
    str_chr = (lili_allContigs[i][4].split(" "))[1]
    str_RE = (lili_allContigs[i][0].split(" "))[1]
    try:
        li_matches = ((lili_allContigs[i][2].split(" "))[1]).split(";")
    except IndexError:
        li_matches = []
    sys.stderr.write("On %s\n" % str_chr)
    for j in range(len(li_matches)):
        if j % 1000 == 0:
            sys.stderr.write("On site %s\n" % j)
        int_beg = int(li_matches[j]) + 1 - int_len
        if int_beg <= 1:
            int_beg = 1
        str_argGen = "-g" + str_genome
        str_argChr = "-c" + str_chr
        str_argBeg = "-b%s" % int_beg
        str_argLen = "-l%s" % (int_len + int_len + int_lenREsite)
        output = check_output(["get_genome_seq", str_argGen, str_argChr, str_argBeg, str_argLen])
        li_output = output.split("\n")
        str_seq = (li_output[1].split(" "))[1]
        str_IDChr = (li_output[4].split(" "))[1]
        #Get upstream of the RE site
        str_IDBegU = str(int_beg)
        str_IDEndU = str(int_beg + int_len + int_lenREsite)
        str_fastaID = ">" + str_IDChr + "_" + str_IDBegU + "_" + str_IDEndU + "_" + str_RE + "_U"
        sys.stdout.write(str_fastaID + "\n")
        sys.stdout.write(str_seq[0:int_len + int_lenREsite] + "\n")
        #Get downstream of the RE site
        str_IDBegD = str(int_beg + int_len)
        str_IDEndD = str(int_beg + int_len + int_len + int_lenREsite)
        str_fastaID = ">" + str_IDChr + "_" + str_IDBegD + "_" + str_IDEndD + "_" + str_RE + "_D"
        sys.stdout.write(str_fastaID + "\n")
        sys.stdout.write(str_seq[int_len:] + "\n")
        int_numREsites = int_numREsites + 1
sys.stderr.write("Number of restriction sites: %s\n" % int_numREsites)
To use it, the reference genome has to be indexed with Biopieces:
#Index the reference genome with Biopieces so that we can search it with get_genome_seq
read_fasta -i sbi1.fasta | format_genome -f fasta -g Sbi1 -x
#Then run the Python script. This takes a very long time in the current implementation (~1 hour for 22k RE sites). 
time python python.py FseIout.txt > FseIRADtags.fa

Friday, July 12, 2013

Determine the type of an object in Python

Python 2.7.2
Credit goes to this post at Stack Overflow:
type(object)
And here's the examples from the aforementioned post, provided by a poster named poke:
>>> type( [] ) == list
True
>>> type( {} ) == dict
True
>>> type( "" ) == str
True
>>> type( 0 ) == int
True
>>> class Test1 ( object ):
    pass
>>> class Test2 ( Test1 ):
    pass
>>> a = Test1()
>>> b = Test2()
>>> type( a ) == Test1
True
>>> type( b ) == Test2
True
>>> type( b ) == Test1
False
>>> isinstance( b, Test1 )
True
>>> isinstance( b, Test2 )
True
>>> isinstance( a, Test1 )
True
>>> isinstance( a, Test2 )
False
>>> isinstance( [], list )
True
>>> isinstance( {}, dict )
True

Call external program from Python and capture its output

Credit goes to this Stack Overflow post, and this Stack Overflow post.
Using Python 2.7.3. This example calls ls with the -l option:

#!/usr/bin/python
# -*- coding: utf-8 -*-
 
from subprocess import check_output
 
output = check_output(["ls", "-l"])
print(output)

Thursday, July 11, 2013

Installation of Biopieces on Ubuntu Linux 12.10 - Ruby installation

Installing Biopieces was not as trivial as I had hoped due to some version issues with Ruby; using the Debian package for Ruby did not work for me. Here's a combination of what ultimately what worked for me on two of my systems (32-bit Linux Ubuntu 12.10, and 64-bit Linux Ubuntu 12.10) after considerable troubleshooting.

Install these dependencies to be able to compile Ruby from source; credit goes to this blog for listing them.
$ sudo apt-get install build-essential vim git-core curl  
  
$ sudo apt-get install bison openssl libreadline6 libreadline6-dev zlib1g zlib1g-dev libssl-dev libyaml-dev libxml2-dev libxslt-dev autoconf libc6-dev ncurses-dev  
  
$ sudo apt-get install libcurl4-openssl-dev libopenssl-ruby apache2-prefork-dev libapr1-dev libaprutil1-dev  
  
$ sudo apt-get install libx11-dev libffi-dev tcl-dev tk-dev
And also get subversion (Warning! I may be missing some dependencies that were already on my systems, such as Perl and CPAN).

$ sudo apt-get install subversion

After those dependencies were installed, I used the biopieces install script to guide my search for additional dependencies. Running the script searched for the dependencies and output which were missing.

#Change to the directory to which it was downloaded (cd ~/Downloads in my case). One could also use wget here to download it.
$ chmod u+x biopieces_install.sh
$ ./biopieces_install.sh

At this point, the installer would abort after citing missing Perl dependencies. Each Perl dependency was obtained from CPAN (Perl's package manager).

$ sudo cpan
> install Inline
> install JSON::XS
#...and so on until all packages were obtained

And finally the installer would abort after being unable to find Ruby.

The real problems came in with installing Ruby. I ran into Ruby and Rubygems (Ruby's package manger) version problems and an issue installing the gem RubyInline due to an error in a RubyInline dependency called ZenTest. ZenTest required a different version of rubygem package manager than that which I had installed. The best solution I found was to avoid using the Debian packages for Ruby and Rubygems and compile them from source. Due to pilot error while trying to troubleshoot the Ruby installation, I had unwittingly put multiple versions on my system. The first thing I needed to do was get a fresh start by wiping all Ruby and Rubygem off my system. I did this by testing if Ruby and rubygem were in my path; if they were, these command return version numbers. For me, Biopieces required a Ruby version > 1.9 and a gem version > 2.0.

$ ruby -v
$ gem -v

So I just started removing Debian packages until those commands were unable to be found.

$ sudo apt-get remove ruby
$ sudo apt-get remove rubygems
$ sudo apt-get remove ruby1.9
$ sudo apt-get remove ruby1.8

And even then the executable for Ruby was still in my path. So I did the inelegant to get it out of my path:

$ sudo rm /usr/bin/ruby

Finally, testing versions returned nothing. Ruby nor gem were in my path.

Without Ruby on the system I had a bit of a fresh start. At this point, I grabbed the ruby_installer.sh script from the Biopieces site. This is a script to compile Ruby from source.

#Change to the directory to which it was downloaded (cd ~/Downloads in my case). One could also use wget here to download it.
$ chmod u+x ruby_installer.sh
$ ./ruby_installer.sh

This will compile Ruby from source and try to install the necessary gems, but for me it ultimately failed due to the issues with ZenTest and the provided version of Rubygem. As far as I could tell, this was because rubygems was not able to update through the ruby_install.sh script as was intended. The error I got was:

Installing RubyInline gem.
Fetching: ZenTest-4.9.2.gem (100%)
Invalid gemspec in [/home/user_name/ruby_install/lib/ruby/gems/1.9.1/specifications/ZenTest-4.9.2.gemspec]: Illformed requirement ["< 2.1, >= 1.8"]
Invalid gemspec in [/home/user_name/ruby_install/lib/ruby/gems/1.9.1/specifications/ZenTest-4.9.2.gemspec]: Illformed requirement ["< 2.1, >= 1.8"]
Fetching: RubyInline-3.12.2.gem (100%)
Invalid gemspec in [/home/user_name/ruby_install/lib/ruby/gems/1.9.1/specifications/ZenTest-4.9.2.gemspec]: Illformed requirement ["< 2.1, >= 1.8"]
ERROR:  Error installing RubyInline:
 RubyInline requires ZenTest (~> 4.3)

Importantly, at the end it states:

All done. Now append the following to your ~/.bashrc file:
export PATH="/home/user_name/ruby_install/bin:$PATH"

Here I added the ruby_install bin directory to my path by running that export command:

$ export PATH="/home/user_name/ruby_install/bin:$PATH"

To resolve the ZenTest illformed requirement error, I had to update Rubygems:

#This also gave some warnings about invalid gemspec and illformed requirement for ZenTest
$ gem update --system

And finally:

$ gem install RubyInline

At that point, the biopieces_install.sh installer worked. Don't forget to add the ruby_install bin directory to your .bashrc file. Lastly, this still leaves the third party software (BWA, Velvet, Bowtie, etc.) to be installed.

In silico restriction enzyme digest of eukaryotic genome

Update: see this newer post.

I'm trying to find a resource for in silico restriction enzyme digestion of a eukaryotic genome. Here's what I've found so far:

A suggestion in the SeqAnswers forum to use Biopieces. Biopieces looks like it was written by Martin Hansen. This link contains the documentation for the digest_seq component. This link for documentation of extract_seq may also be useful. Here is a bit from that post on SeqAnswers:
read_fasta -i genome.fna | digest_seq -p GGATCC -c 1 | plot_lendist -k SEQ_LEN -x

Some code in Java posted on a forum here at the BioStar forums by a user named Pierre Lindenbaum. The code posted follows:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.zip.GZIPInputStream;



public class Digest
    {
    private static final String CHROMOSOMES[]={
        "chr1","chr2","chr3",
        /* etc */
        "chrY"
        };
    private static class Enzyme
        {
        String name;
        String site;
        public Enzyme(String name,String site)
            {
            this.name=name;
            this.site=site.toUpperCase();
            }
        }
    //see   
    private static final Enzyme ENZYMES[]=new Enzyme[]
        {
        new Enzyme("AatII","gacgtc"),
        new Enzyme("AbsI","cctcgagg"),
        new Enzyme("Acc65I","ggtacc"),
        /* (...) */
        new Enzyme("BamHI","ggatcc"),
        /* (...) */
        new Enzyme("EcoRI","gaattc"),
        /* (...) */
        new Enzyme("XmnI","gaannnnttc")
        };

    private static boolean compatible(char a,char b)
        {
        b=Character.toUpperCase(b);
        if("ATGC".indexOf(b)==-1) return false;
        switch(a)
            {
            case 'A': return b==a;
            case 'T': return b==a;
            case 'G': return b==a;
            case 'C': return b==a;


            case 'W': return "AT".indexOf(b)!=-1;
            case 'S': return "GC".indexOf(b)!=-1;
            case 'R': return "AG".indexOf(b)!=-1;
            case 'Y': return "CT".indexOf(b)!=-1;
            case 'M': return "AC".indexOf(b)!=-1;
            case 'K': return "GT".indexOf(b)!=-1;


            case 'B': return "CGT".indexOf(b)!=-1;
            case 'D': return "AGT".indexOf(b)!=-1;
            case 'H': return "ACT".indexOf(b)!=-1;
            case 'V': return "ACG".indexOf(b)!=-1;

            case 'N': return "ACGT".indexOf(b)!=-1;

            default: return false;
            }
        }



    public static void main(String[] args)
        {
        try
            {
            for(Enzyme enzyme:ENZYMES)
                {
                for(String chr:CHROMOSOMES)
                    {
                    URL url=new URL("http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/"+chr+".fa.gz");
                    BufferedReader r=new BufferedReader(new InputStreamReader(new GZIPInputStream(url.openStream())));
                    int c;
                    int genome_pos=0;
                    int previous_pos=0;
                    char buffer[]=new char[enzyme.site.length()];
                    int buffer_length=0;
                    while((c=r.read())!=-1)
                        {
                        if(c=='>')
                            {
                            while((c=r.read())!=-1 && c!='\n') { /* skip */}
                            genome_pos=0;
                            buffer_length=0;
                            previous_pos=0;
                            continue;
                            }
                        if(!Character.isLetter(c)) continue;
                        buffer[buffer_length++]=(char)c;
                        genome_pos++;

                        if(buffer_length==buffer.length)
                            {
                            int i=0;
                            for(i=0;i< buffer.length;++i)
                                {
                                if(!compatible(enzyme.site.charAt(i),buffer[i])) break;
                                }
                            if(i==buffer.length)
                                {
                                System.out.println(
                                    chr+"\t"+enzyme.name+"\t"+(previous_pos)+
                                    "\t"+(genome_pos-buffer.length)+
                                    "\t"+((genome_pos-buffer.length)-previous_pos)
                                    );
                                previous_pos=genome_pos-buffer.length;
                                }
                            buffer_length--;
                            System.arraycopy(buffer, 1, buffer, 0, buffer.length-1);
                            }

                        }
                    r.close();
                    System.out.println(
                            chr+"\t"+enzyme.name+"\t"+(previous_pos)+
                            "\t"+(genome_pos-buffer.length)+
                            "\t"+((genome_pos-buffer.length)-previous_pos)
                            );
                    }
                }
            }
        catch (Exception e)
            {
            e.printStackTrace();
            }
        }
    }
Compilation:
javac Digest.java
Execution:
java Digest
Output:
chr1    AatII    0    47476    47476
chr1    AatII    47476    70600    23124
chr1    AatII    70600    164453    93853
chr1    AatII    164453    170555    6102
chr1    AatII    170555    175979    5424
chr1    AatII    175979    325093    149114
chr1    AatII    325093    360629    35536
chr1    AatII    360629    369596    8967
chr1    AatII    369596    568607    199011
chr1    AatII    568607    568665    58
chr1    AatII    568665    620090    51425
(...)

Wednesday, July 10, 2013

Count number of reads in a SAM file above or below a mapping quality score with awk

Another newbie use of awk. This is a one line program for counting the number of reads in a SAM file based on a mapping quality score threshold (column 5). It can also be easily modified for counting lines on some other condition.

In this example, we're testing if the value of column 5 for the row is greater than or equal to 20 (a 99% probability the read was mapped correctly), and incrementing a counter variable. The condition could be modified for any condition that is of interest to count.
$ awk 'BEGIN{i=0} $5>=20 {i=1+i} END{print i}' inputsam.sam

awk also has logical operators for "and" and "or": "&&" and "||":
$ awk 'BEGIN{i=0} $5>=20 && $5<=30 {i=1+i} END{print i}' inputsam.sam

Tuesday, July 9, 2013

Find CPU information in Ubuntu (12.10)

To get information on regarding your CPU(s) from the command line in Ubuntu:

$ cat /proc/cpuinfo


To get the model of your CPU(s) specifically:
$ cat /proc/cpuinfo | grep model\ name

Interesting response to Nature Article "Biology must develop its own big data systems"

The original article, found here, seems to place the blame on the database engineers. I also found the response by a commenter, Tony Berno, interesting:

Tony Berno•2013-07-08 03:45 PM

This editorial is frustrating because it could have been written fifteen years ago, and there has been little in the way of progress since then. If "people are not going to change" and "the problem is not technical", then what opportunity is there for progress? We already have excellent systems for representing, maintaining, and querying data; conventional relational databases are a particularly mature example. The SQL language used to query them has a few counter-intuitive features, but on the whole it is difficult to imagine an easier or more comprehensive technology for asking a wide variety of questions about large, complex datasets. Yet I regularly encounter senior scientists and informatics directors who look at relational data models and proclaim, almost defiantly, that they "mean nothing" to them. Replacements for this well-established technology are sometimes more powerful or scalable, and are usually sold on their technical merits, but in reality they are motivated by magical thinking. It is believed that if scientists can formulate questions "without programming", it is assumed they will obtain answers without effort. Unfortunately, what is called "programming" is actually the easiest part of this process - the discipline required to think rigorously about data and formulate precise questions about it is much harder to learn, and it cannot be delegated or sidestepped by replacing the underlying technology with something "visual" or "intuitive". This approach actually makes the situation worse by establishing unrealistic expectations about the nature of the problem. Also, the requirement of mapping data to clean, transparent, and shared ontologies at the time of production (or before!) cannot be avoided by any current technology. Databases that are "flexible" in accepting varying data structures simply push the data curation problem to the time of the query, at which point it must be solved independently by multiple users who are even less prepared to address it. This inevitably results in a "roach motel" database - data goes in, but doesn't come out. It is not reasonable to expect every biologist to become a software engineer, but the basics of data modelling and query construction are not especially hard to learn. The current situation is akin to construction workers that refuse to read architectural diagrams. This is simply unsustainable, and it seems to me that as a first step, an introductory course in database technology must be required of all biology graduates. It is an unfortunate truth that many biologists chose their specialty out of a desire to avoid the rigorous mathematics of physics or chemistry. This is a critical mistake. Biology has changed dramatically, becoming one of the most mathematics- and data-intensive of all the sciences. If its culture does not fully embrace the intellectual challenge presented by its own data models, it will forever fall short of its potential.

Read in file by command line argument and split on tab delimiter in Python

An example of how to read a file into a list based on a command line argument and split on a tab delimiter on Python with some error catching.

#!/usr/bin/python
# -*- coding: utf-8 -*-

import sys

try:
    li_tsv = [line.strip() for line in open(sys.argv[1])]  # Open target file, strip newline characters from lines, define as list.
    li_tsv = [element.split('\t') for element in li_tsv]  # Split elements of list based on tab delimiters.
except IndexError:  # Check if arguments were given
    sys.stdout.write("No arguments received. Please check your input:\n")
    sys.stdout.write("\t$ python.py input.tsv\n")
    sys.exit()
except IOError:  # Check if file is unabled to be opened.
    sys.stdout.write("Cannot open target file. Please check your input:\n")
    sys.stdout.write("\t$ python.py input.tsv\n")
    sys.exit()

Monday, July 8, 2013

$\LaTeX$ in LibreOffice Impress

I'm currently trying to prepare a presentation for my first ever committee meeting (eeks!), and I am using LibreOffice Impress. I found this LaTeX add-on and now equation presentation won't be my downfall (phew!)

After downloading and installing the app from:

http://extensions.libreoffice.org/extension-center/texmaths-1

Open a new Impress instance and you will find a $\pi$ icon in the upper left hand tool bar. Type your equations in good ol' $\LaTeX$ and it will be displayed as an svg or png. The equation is saved so you can edit as well!

Newbie awk usage example

I just started using awk for pulling data out of a flatfile. The $\$$number values (e.g. $\$$1) represent columns (e.g. $\$$1 is column 1) of a tab separated value file (designated with the -F'\t' option). Here, awk is receiving lines piped to it from head, and outputting information for rows in which column 14 is equal to 4.
#Test of using Aho, Weinberger, and Kernighan.

#Statements (i.e. print statements) can be put in the brackets for Begin and End for execution.
head -n 50 LeafAngle_BTx6230-IS3620C_Field_06-13-13.csv | awk -F'\t' '\
BEGIN { print "Plant\tRep\tLeaf4_Area\tLeaf4_Max_Width\tLeaf4_Average_Width\tLeaf4_Length" } \
$14=="4" { print $12"\t"$13"\t"$15"\t"$17"\t"$18"\t"$19 } \
END { } \ 
'
And for use without head:
#Test of using Aho, Weinberger, and Kernighan.

#Statements (i.e. print statements) can be put in the brackets for Begin and End for execution.
awk -F'\t' '\
BEGIN { print "Plant\tRep\tLeaf4_Area\tLeaf4_Max_Width\tLeaf4_Average_Width\tLeaf4_Length" } \
$14=="4" { print $12"\t"$13"\t"$15"\t"$17"\t"$18"\t"$19 } \
END { } \ 
' LeafAngle_BTx6230-IS3620C_Field_06-13-13.csv > Leaf_4_out.tsv

Vim regex

Regular expression (regex) resource for Vim: VimRegex

How to show line numbers in vim

To display line numbers in vim, use the set number command.

:set number

To turn off line numbers in vim, use the set number! command.

:set number!

Write ordered list in LaTeX using enumerate

One way to write an ordered list in LaTeX is to use the enumerate package. Include the package enumerate:
\usepackage{enumerate}
Using enumerate to write ordered lists.
\begin{enumerate}
 \item List 1 Item
 \begin{enumerate}
  \item Sublist of 1 Item 1
  \item Sublist of 1 Item 2
 \end{enumerate}
 \item List 2 Item
\end{enumerate}
This yields:

How to show LaTex in Blogger

How to show LaTex in Blogger, taken from this blog. See this post to view an example in our blog.
Include this before </head> in your Template:
<script type='text/x-mathjax-config'> MathJax.Hub.Config({tex2jax: {inlineMath: [[&#39;$&#39;,&#39;$&#39;], [&#39;\\(&#39;,&#39;\\)&#39;]]}}); </script> 
<script src='http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' type='text/javascript'> </script>

Sunday, July 7, 2013

How to show code snippets in Blogger

Information taken from this blog.

This works for now, don't know if it will be optimal. See also this blog; these may be better?

Add this to Template, above </head>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Then you can use <pre> tag for code.
<pre class="brush: cpp">
int x = 2;
</pre>
gives
int x = 2;

smart and graduate studies are mutually exclusive

Rest assured fellow graduate students:

Let $S$ be a set of smart individuals and $P$ be the set of PhD students, then we can safely assume that
\[S\cap P=\emptyset.\]

I have been reading advice from the following blog: http://matt.might.net/articles/successful-phd-students/.

I'm not smart, so it has been very calming for me to hear that being smart is "neither sufficient nor necessary" to obtain a PhD. In fact, smart people don't go to graduate school.

Phew!


Collaborative meeting platform

Scheduling a committee meeting is akin to herding cats. Here's the best solution for a collaborative meeting platform we've found so far:
Doodle:
http://www.doodle.com/

Count number of files in directory using bash

To count the number of files in a directory using a bash shell, one can use a variation of:
ls -1 | wc -l
such as
ls -1 *.tsv | wc -l
Note that the option after ls is the number "1" and not the letter "l". This will take the output of ls and pipe it to wordcount with the newline option, where it counts the number of newline characters.