Wednesday, December 15, 2010

SSAKE ... better assembler?

Performance

SSAKE is a genomics application for assembling millions of very short DNA sequences.

Project Description

The Short Sequence Assembly by K-mer search and 3' read Extension (SSAKE) is a genomics application for aggressively assembling millions of short nucleotide sequences by progressively searching for perfect 3'-most k-mers using a DNA prefix tree. SSAKE is designed to help leverage the information from short sequences reads by stringently clustering them into contigs that can be used to characterize novel sequencing targets.
*Best performance is achieved by quality-trimming your reads before assembly

Summary

SSAKE is written in PERL and runs on Linux. SSAKE cycles through short sequence reads stored in a hash table and progressively searches through a prefix tree for the longest possible identical overlap between any two sequences. The algorithm was used to assemble 25-36 bp sequence reads from viral, bacterial and fungal genomes and on forty millions 25-mers simulated using the whole-genome shotgun (WGS) sequence data from the Sargasso sea metagenomics project. Considering the number of sequences to assemble, SSAKE is robust and tractable.

Documentation

René L Warren, Granger G Sutton, Steven JM Jones, Robert A Holt. 2007 (epub 2006 Dec 8). Assembling millions of short DNA sequences using SSAKE. bioinformatics. 23:500-501.

License

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Credits

René Warren, Granger Sutton, Steven Jones and Robert Holt

Source: http://www.bcgsc.ca/platform/bioinfo/software/ssake

Cloud computing for Bioinformatics

We are used to having huge datasets pouring out of high-throughput genome centres, but with the advent of ultra high-throughput sequencing, genotyping and other functional genomics in every laboratory we are facing a scary new era in petabyte scale data. For example, the 1000 genomes' projects will probably produce about 1 Tb of finished data. To process data, this project required about 100 Tbs of scratch disk. Working at this level, real technical limitations start to hamper progress. One has to consider storage, but not just having enough, but making sure its available to your compute (network), that you have sufficient I/O to do anything in real time. Software language and implementation become critical factors when dealing with terabytes of data. With such high-intensity computing, power (getting enough), cooling, etc. become real issues. How do you let anyone else access the data? Is the data backed up and even if it is how many years would it take to restore from tape?

So how will we solve all these technical hurdles? Each of these can be solved with technical knowledge. But you do not want to have to worry about working within these constraints.

When working with large datasets, these constraints can continually hamper progress on getting real research done. Whilst one can choose to solve each of these individual problems, the impact of these constrains on the scientific workflow can be considerable. It would be wiser to optimize for productivity.

In software development, similar constrains are addressed with abstraction layers. Database access is mediated through relational mapping tools, visualization is aided with powerful graphical packages preventing individual research groups from having to reinvent the wheel. Rails, Eclipse, Processing, Hibernate, Catalyst.

Cloud computing offers a similar level of abstraction for many of the constraints encountered when dealing with extremely large (?) datasets. You might have encountered similar ideas when using hosted services such as Google Mail, ManyEyes (http://manyeyes.alphaworks.ibm.com), others. These tools provide an example of what we would ideally like in the perfect world of Bioinformatics. We do not have to worry about how the data is stored, keeping the software up to date. Its all taken care of for you. First steps have been taken along these lines by companies such as Amazon, Google and Microsoft. Amazon has started to provide Bioinformatics datasets in their publicly hosted datasets (http://aws.amazon.com/publicdatasets/) such as Ensembl and Genbank.

A recent requirement to assemble a full human genome from 454 short read data provided a good real life example of these approaches. With 140 million individual reads requiring alignment using SSAHA exceeding the available compute capacity in our own data centre, a build was performed on Amazon's elastic compute service, EC2. In an afternoon, a scalable, ad hoc cluster with queue management and replicated data storage was constructed with nothing more than a few web service calls and a valid credit card. No service contracts. No consultation with the vendor, just 100 nodes performing SSAHA alignments.

Implications for large scale data centers, currently engineered to provide peak capacity, which often goes unused in idle periods. The elastic, pay-as-you-go nature of cloud services such as AWS means lower infrastructure overheads, as only in-use compute and storage is billable.

Cloud computing has green credentials too, so long as the off-site compute is located where renewable sources of energy are used preferentially. Additionally, whilst unused compute may still require cooling and power in a local data centre, it can be reused by others in the cloud.

The transfer of large datasets can also be simplified with cloud approaches. As an alternative to shipping the data for others to analyse, cloud approaches allow the compute to remain close to the data. Allowing others to access your compute infrastructure may be preferable to distributing large datasets.

Source: Bioinformatics

(2009) 25 (12): 1475. doi: 10.1093/bioinformatics/btp274 First published online: May 12, 2009

Contact: agb@sanger.ac.uk

Harvard, Princeton researchers developing implantable "biocomputers"

Researchers at Harvard and Princeton have announced that they've made a "crucial step" in the development of so-called "biocomputers," which could one day be implanted in patients to directly attack diseased cells or tissues Fantastic Voyage-style. According to Physorg, the computers are actually constructed entirely out of DNA, RNA, and proteins, and are able to translate complex cellular signatures like the activities of multiple genes into a form that can be more readily observed. Currently, the researchers have demonstrated that the biocomputers can work in human kidney cells in culture, although they seem confident that they'll eventually find a wind range of uses, including working in conjunction with biosensors or medicine delivery systems to target, for instance, only cancerous or diseased cells, without causing any harm to the patient's healthy cells.

The results will be published this week in the journal Nature Biotechnology.
"Each human cell already has all of the tools required to build these biocomputers on its own," says Harvard's Yaakov (Kobi) Benenson, a Bauer Fellow in the Faculty of Arts and Sciences' Center for Systems Biology. "All that must be provided is a genetic blueprint of the machine and our own biology will do the rest. Your cells will literally build these biocomputers for you."
Evaluating Boolean logic equations inside cells, these molecular automata will detect anything from the presence of a mutated gene to the activity of genes within the cell. The biocomputers' "input" is RNA, proteins, and chemicals found in the cytoplasm; "output" molecules indicating the presence of the telltale signals are easily discernable with basic laboratory equipment.
"Currently we have no tools for reading cellular signals," Benenson says. "These biocomputers can translate complex cellular signatures, such as activities of multiple genes, into a readily observed output. They can even be programmed to automatically translate that output into a concrete action, meaning they could either be used to label a cell for a clinician to treat or they could trigger therapeutic action themselves."
Benenson and his colleagues demonstrate in their Nature Biotechnology paper that biocomputers can work in human kidney cells in culture. Research into the system's ability to monitor and interact with intracellular cues such as mutations and abnormal gene levels is still in progress.
Benenson and colleagues including Ron Weiss, associate professor of electrical engineering at Princeton, have also developed a conceptual framework by which various phenotypes could be represented logically.
A biocomputer's calculations, while mathematically simple, could allow researchers to build biosensors or medicine delivery systems capable of singling out very specific types or groups of cells in the human body. Molecular automata could allow doctors to specifically target only cancerous or diseased cells via a sophisticated integration of intracellular disease signals, leaving healthy cells completely unaffected.

Source: Harvard University

Thursday, December 2, 2010

NASA Finds New Life

NASA has discovered a new life form, a bacteria called GFAJ-1 that is unlike anything currently living in planet Earth. It's capable of using arsenic to build its DNA, RNA, proteins, and cell membranes. This changes everything. NASA is saying that this is "life as we do not know it". The reason is that all life on Earth is made of six components: Carbon, hydrogen, nitrogen, oxygen, phosphorus and sulfur. Every being, from the smallest amoeba to the largest whale, share the same life stream. Our DNA blocks are all the same.

That was true until today. In a surprising revelation, NASA scientist Felisa Wolfe-Simon and her team have found a bacteria whose DNA is completely alien to what we know today, working differently than the rest of the organisms in the planet. Instead of using phosphorus, the newly discovered microorganism—called GFAJ-1 and found in Mono Lake, California—uses the poisonous arsenic for its building blocks. Arsenic is an element poisonous to every other living creature in the planet except for a few specialized
microscopic creatures. (Mono Lake, California. Image by Sathish J —Creative Commons)

Here's the organism and a computer simulation on

how it substitutes phosphorus for arsenic in its DNA

Talking at the NASA conference, Wolfe-Simon said that the important thing in their study is that this breaks our ideas on how life can be created and grow, pointing out that scientists will now be looking for new types of organisms and metabolism that not only uses arsenic, but other elements as well. She says that she's working on a few possibilities herself.
NASA's geobiologist Pamela Conrad thinks that the discovery is huge and "phenomenal," comparing it to the Star Trek episode in which the Enterprise crew finds Horta, a silicon-based alien life form that can't be detected with tricorders because it wasn't carbon-based. It's like saying that we may be looking for new life in the wrong places with the wrong methods.

Indeed, NASA tweeted that this discovery "will change how we search for life elsewhere in the Universe."

I don't know about you but I've not been so excited about bacteria since my STD tests came back clean. And that's without counting yesterday's announcement on the discovery of a massive number of red dwarf stars, which may harbor a trillion Earths, dramatically increasing our chances of finding extraterrestrial life.

The new life forms up close, at five micrometers.

Source: http://gizmodo.com/5704158/nasa-finds-new-life

Wednesday, September 15, 2010

Bioinformatics is now Harder Better FASTER Stronger.. Courtesy GPU Computing

GPU computing or GPGPU is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.

The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

With the introduction of NVIDIA Tesla Bio Workbench, it provides bio-physicists and computational chemists the tools to push the boundaries of bio-chemical research, optimizing the scientific workflow and accelerating the pace of research. Sequencing and protein docking are very compute-intensive tasks that see a large performance benefit by using a CUDA-enabled GPU. There is quite a bit of ongoing work on using GPUs for a range of bio-informatics and life sciences codes.

Some examples are given below:

GPU-HMMER accelerates the hmmsearch tool using GPUs and gets speed-ups ranging from 60-100x. GPU-HMMER can take advantage of multiple Tesla GPUs in a workstation to reduce the search from hours on a CPU to minutes using a GPU.

MUMmerGPU uses the new Compute Unified Device Architecture
(CUDA) from nVidia to align multiple query sequences against a single reference sequence stored
as a suffix tree. By processing the queries in parallel on the highly parallel graphics card,
MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence
alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU
by 3.5-fold in total application time when aligning reads from recent sequencing projects using
Solexa/Illumina, 454, and Sanger sequencing technologies.

CUDA-BLASTP running on a workstation with two Tesla C1060 GPUs is 10x faster than NCBI BLAST (2.2.22) running on an Intel i7-920 CPU. This cuts compute time from minutes on CPUs to seconds using GPUs.

CUDASW++ is a bio-informatics software for Smith-Waterman protein database searches that takes advantage of the massively parallel CUDA architecture of NVIDIA Tesla GPUs to perform sequence searches 10x-50x faster than NCBI BLAST. CUDASW++ supports query lengths up to 59K.

VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. Several key kernels and applications in VMD now take advantage of the massively parallel CUDA architecture of NVIDIA’s GPUs. These applications run 20x to 100x faster when using a NVIDIA CUDA GPU compared to running them on a CPU only.

For more information, visit http://www.nvidia.com/object/tesla_bio_workbench.html

Source: www.nvidia.com

Nadeem's Bioinformatics Resources

About Me