Bringing Programming to Biology
Posted On August 1, 2001 by Ramdas S filed under Enterprise , Miscellaneous
The importance of programming in biology stretches back before the previous decade. And it certainly has a significant future now that it is a recognised part of research in many areas of medicine and basic biological research. This may not be news to biologists. In any case we'll look at some interesting programming tools in bioinformatics that help in making biocomputing an easy job.
Bioperl
Bioperl is a collection of Perl modules that facilitate the development of Perl scripts for bio-informatics applications. Bioperl provides reusable Perl modules that facilitate writing Perl scripts for sequence manipulation, accessing of databases using a range of data formats and execution and parsing of the results of various molecular biology programs including Blast, clustalw, TCoffee, genscan, ESTscan and HMMER. Consequently, Bioperl enables developing scripts that can analyze large quantities of sequence data in ways that are typically difficult or impossible with web based systems.
The Bioperl project is a coordinated effort to collect computational methods routinely used in bioinformatics into a set of standard, well-documented, and freely available Perl modules.
In order to take advantage of Bioperl, the user needs a basic understanding of the Perl program-ming language including an understanding of how to use Perl references, modules, objects and methods. Many developers in the bioinformatics community make extensive use of the Perl programming language, which excels at such tasks. However, despite the widespread use of Perl by computational biologists, there are no standard Perl modules available that are designed specifically for bioinformatics. Different developers have therefore had to independently write code for common tasks in bioinformatics. This has lead to the creation of similar but mutually incompatible programs, modules, and functions. The Bioperl project addresses this deficiency by coordinat-ing the development of a set of core Perl modules with well-documented interfaces that provide standard functionality for molecular biological data management. Bioperl modules are offered as "common currency" for data management and algorithmic operat-ions frequently required in bioinforma-tics. By disseminating standard modules for common objects and operations, Bioperl hopes to promote the development of interoperable programs and decrease the amount of redundant effort by bioinformatics developers. Bioperl modules currently under development address the man-agement of nucleotide and protein sequences, sequence alignments, BLAST reports, and 3D structures.
Software
Modules were developed using Perl version 5.002 or higher and rely heavily on the object-oriented features of Perl. The Bioperl modules are distributed in a standard CPAN (Comprehensive Perl Archive Network) form, which makes them easy to install and test. Source code for modules, as well as the Perl source, is free for use under the Perl artistic license.
Core Objects
Sequence: Seq.pm implements a Perl object containing a single nucleotide or peptide sequence. Programs and other objects that create, produce, or edit sequences are expected to operate on and return sequence objects.
Sequence Alignment: There are two modules which can be used for multiple sequence alignments: SimpleAlign.pm provides a simplistic view of an alignment, treating it as immutable text that can converted into a number of formats. Despite this limitation, it provides a convenient way to store, interconvert and manipu-late alignments when the alignment is fixed.
BLAST: The Blast.pm sequence analysis module is an example of a utility, or wrapper type of module. Such modules provide an interface for key computational tools whose use is ubiquitous within bioinformatics. The widespread, routine use of the BLAST sequence alignment algorithm has created a need for a standard object which can execute commands to generate BLAST reports as well as parse and analyze the BLAST output.
Alignment Factories: A set of alignment factories (objects which make alignments) has been implemented for the protein Smither-Waterman algorithm. These alignment factories produce SimpleAlign objects from two sequences and parameters are stored in the factory.
3D Structure: The Bioperl structure object is intended to encapsulate coordinate data and annotation for 3D structures as well as core data management and possibly analysis features. Just as Seq.pm permits format-independence for sequence data, the structure object is expected to permit format-independent manipulation of 3D structural data.
Perl has matured from a simple scripting language to powerful programming environment for both object-oriented and procedural styles. Standard modules being developed by Bioperl intend to provide a coherent set of tools with well-documented interfaces for standard molecular biological data management tasks. Widespread use of common modules will minimize reinvention and foster better code reuse.
BioJava
BioJava is an open-source initiative with the aim of providing the biological community with objects for representing and manipulating biological information. BioJava’s framework can build everything from simple scripts to complete applications. It is designed to be used as a library, so to make it usable we must:
- Design by Interface but provide working implementations so that you can always extend or replace behaviour and implementations.
- Provide extensive API documentation as well as a clear overview of how it all fits together.
- Give simple examples that show how to use the APIs.
The BioJava Project is an open-source project dedicated to providing Java tools for processing biological data. This will include objects for manipulating sequences, file parsers, CORBA interoperability, access to ACeDB, dynamic programming, and simple statistical routines. This distribution contains the biojava classes and resources, along with scripts for building the .class files and the documentation tree. Currently, there are objects for sequences and features such as IO, processing, storing, manipulating and visualizing. The other objects are dynamic programming involving external file formats and programs. we are talking about single sequence and pair wise HMMs, forward and backward algorithms and training models involving GFF, Blast, meme. The last object would be sequence databases.
BioJava aims to provide a comprehensive set of Java components for the rapid development of applications in Bioinformatics. It contains interfaces for representing Sequences, Features, and other important bioinformatics concepts. It can also read and write sequence data in a variety of common formats and communicate with Ensembl databases and with DAS and BioCorba servers.
To use BioJava, add the BioJava and XML jar files to your CLASSPATH environment variable.
UNIX (bourne shell)
export CLASSPATH=/home/thomas/biojava.jar:/home/thomas/xml.jar:.
UNIX (C shell)
setenv CLASSPATH /home/thomas/biojava.jar:/home/thomas/xml.jar:.
Windows from command line
set CLASSPATH C:\biojava.jar;C:\xml.jar;.
Windows from autoexec.bat
set CLASSPATH=C:\biojava.jar;C:\xml.jar;.
BioJava programs can be compiled and run using the javac and java commands. BioJava uses its own build tool, implemented in pure Java. The first time you build BioJava, you may need to compile the build tool:
javac build/Builder.java
You can then compile the whole project using:
java build.Builder all
This creates a biojava.jar file, which can be used just like the binaries downloaded from the FTP site. You can also build the API documentation:
java build.Builder docs
The BioJava library is useful for automating daily, mundane bioinformatics tasks. As the library matures, the BioJava libraries will provide a foundation upon which both free software and commercial packages can be developed.
BioXML
XML is a rapidly developing set of technologies, which can help us in the transfer and visualization of data, particularly over the web. BioXML is an open-source/free software dedicated to providing a set of standard XML formats for the exchange of biological data.
Building blocks
DTD-let
The most basic building block in bioXML is something called a DTD-let. The idea behind a DTD-let is that it's a small, useful, non-trivial DTD, which solves a bio-data problem which can easily be combined with other DTD's to attack larger problems.
So, an example DTD-let is the bioxml SeqDtd. It represents a bio-sequence. Bio-sequences are at the root of much of bioinformatics, so this is an important concept, which can be built upon. The SeqDtd is general enough to represent any bio-sequence and doesn't try to do anything else. This is the essence of a dtd-let. It does a small, important job well and can act as a building block.
ID's and IDREF's :
Typically, most XML information is embedded in tags. But the XML spec provides another way to link data points. Elements can have an ID attributes and IDREF attributes. Using this method of linking data, as opposed to embedded tags, can often eliminate redundancies and simplify the data model. They are an integral part of the data model as well as BioXlinks.
<xml><toad id="ae3"><prince toad="ae3"></xml>
Namespaces
Namespaces are essential for xml in general, and for bioxml in particular. If you consider DTD-lets in biology that requires a "database" tag, you'll come up with pretty much all of them.
That is where namespaces come into play. You can prefix each term with a NameSpace id followed by a colon. So those become bx-dbxref:database and bx-computation:database.
BioDAS
The distributed annotation system (DAS) is a client-server system in which a single client integrates information from multiple servers. It allows a single machine to gather up genome annotation information from multiple distant web sites, collate the information, and display it to the user in a single view. Little coordination is needed among the various information providers.
Currently the DAS code base consists of:
1. The DAS specification
2. Complete DAS clients:
- Geodesic,
- DasClient,
- OmniDAS/OmniGene
3. Client Libraries
- BioJava client library
- Bio::DAS perl client library
4. Servers
- Dazzle Java server
- Lightweight DAS server (Perl)
- An ACeDB adaptor script (in main DAS distribution)
The current DAS specification is aims to provide a distributed annotation system for genomic sequences. The aim of the project is to explore, based on the current specification and implementations, the suitability of DAS for distributed annotation of protein sequences.
A DAS server responds to client queries for information about sequence annotations. A DAS client may consist of a program that graphically displays annotations, or a program that simply captures the data for further processing, perhaps for loading into a local database. A DAS server returns data to the client in XML format. You can directly view XML data using Internet Explorer 5.0 or higher, or Netscape 5.5 or higher.
There are at least two DAS client programs. One called Geodesic, a stand alone JAVA application and another called DASview, a Perl application that runs as a server side script. Both connect to one or more servers, construct an integrated image, and display a genomic map. Knowledge of molecular biology and of Java is required to work on BioDAS. Familiarity with JDBC would be a plus.
The DAS relies on a common "reference sequence" on which to base annotations. The reference sequence consists of a set of "entry points" into the sequence, and the lengths of each entry point. The identity of an entry point will vary from genome to genome. The entry points describe the top level items on the reference sequence map. It is possible for each entry point to have substructure, basically a series of subsequences (components) and their start and end points. This structure is recursive. Each annotation is unambiguously located by providing its position as the start and stop positions relative to a "reference sequence." The reference sequence can be one of the entry points, or any of the subsequences within the entry point.
BioPython
Biopython is an international collaboration to collect and produce open source bioinformatics tools written in the Python programming language. The distribution includes interfaces to query local and distributed bioinformatics tools and databases, data structures for manipulating biological resources, a generalized parsing framework which unifies the many flat-file formats, libraries for working with different database systems, and servers to provide remote access to these services.
The main biopython releases have lots of functionality, including:
- The ability to parse bioinformatics files into python utilizable data structures, including support for the following formats:
- Blast output-both from standalone and WWW Blast
- Clustalw
- FASTA
- GenBank
- PubMed and Medline
- Expasy files, like Enzyme, Prodoc and Prosite
- SCOP, including 'dom' and 'lin' files
- Rebase
- UniGene
- SwissProt
- Files in the supported formats can be iterated over record by record or
indexed and accessed via a Dictionary interface. - Code to deal with popular on-line bioinformatics destinations such as:
- NCBI — Blast, Entrez and PubMed services
- Expasy — Prodoc and Prosite entries
- Interfaces to common bioinformatics programs such as:
- Standalone Blast from NCBI
- Clustalw alignment program.
- A standard sequence class that deals with sequences, ids on
sequences, and sequence features. - Tools for performing common operations on sequences, such as
translation, transcription and weight calculations. - Code to perform classification of data using k Nearest Neighbors,
Naive Bayes or Support Vector Machines. - Code for dealing with alignments, including a standard way to create
and deal with substitution matrices. - Code making it easy to split up parallelizable tasks into separate
processes. - GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
- Extensive documentation and help with using the modules, including
this file, on-line wiki documentation, the web site, and the mailing list. - Integration with other languages, including the Bioperl and Biojava
projects, using the BioCorba interface standard (available with
the biopython-corba module).
In general you will need to have at least some programming experience (in python, of course!) or at least an interest in learning to program. Biopython's job is to make your job easier as a programmer by supplying reusable libraries so that you can focus on answering your specific question of interest, instead of focusing on the internals of parsing a particular file format.
BioCORBA
The BioCORBA Project provides an object-oriented, language neutral, platform independent method for describing and solving bioinformatic problems. BioCORBA's mission is to leverage the code of the other Bio projects in a simple and easy to use fashion. For example language neutral environment allows users to write programs using BioPython and access BioPerl modules through the CORBA server. Bioinformatics problems rarely have a one-language, one-platform solution. CORBA provides middleware connection for language and platform neutral solutions. BioCORBA provides a toolkit of generic biological objects suitable for creating specialized servers and clients in any language.
A consortium of developers from OMG, EBI, and the Open Bioinformatics Foundation helped create an interface design for standard objects in biological computing. There are bindings for BioCORBA to the BioPerl, BioJava, and BioPython (Bio{*}) open-source toolkits. Additional implementations have been written to the EBI analysis services and to proprietary systems by vendors such as NetGenetics. OMG is considering a proposal for BioCORBA to be the next Sequence Analysis standard.
