The Role of CORBA in Molecular Biology

CORBA (Common Object Request Broker Architecture), as an open standard, is considered to be a good solution for the development and deployment of applications in distributed heterogeneous environements.

This technology can be applied in the bioinformatics area to enhance utlization, management and interoperation between biological resources.

Introduction

Distributed object technology is considered to be a revolution for software design in heterogeneous computing environments. Using this approach, an application can be abstracted and divided into self-managing objects that can interoperate acros heterogeneous networks and operating systems. The OMG, the largest consortium of more than 700 member organisations, has been focussing on specifying the elegant architecture on which objects written by different vendors can interoperate across networks and operating systems. Its main product is the precise specification of the CORBA.

The specification describes a software bus, called as Object Request Broker (ORB), that provides an infrastructure on which a client can invoke the methods of server objects without the knowledge of where the server objects are located, how they are implemented, whether the object is currently activated and what communication mechanisms are used. Usually, a client only needs to know what types of operations an object can provide, i.e. the object's interface. Object interfaces are defined using OMG's Interface Definition Language (IDL). IDL is a declarative language which helps to define object interfaces separately from their implementation. Thus language independence, which is an essential feature of CORBA, is supported.

The use of CORBA in a biological context was introduced by Hu et al. and Lijnzaad et al.

Expressed Sequence Tags (ESTs) are cheap, easy, and quick to obtain relative to full genomic sequencing and currently sample more eukaryotic genes than any other data source. ESTs are particularly useful for developing Sequence Tag Sites (STSs for mapping), polymorphism discovery, disease gene hunting, mass spectrometer protemics, and most ironically for finding genes and predicting gene structure after the great effort of genomic sequencing.

Sequence databases are currently the most familiar biological database and a typical source for EST sequences. Traditionally these databases are presented and distributed using "flat file" views of the sequence originally submitted, but these static raw views are being superseded by novel interfaces better suited to the need of academic and industrial researchers. The new browsing interfaces are needed because current research may need to investigate nucleotide or protein sequences grouped by their function, phylogeny, map position, cellular location, expression regulation, disease association, annotation similarity or metabolic pathway etc.

To support these new investigative cross-database clients, it is useful to have standardized API level access to both data and services - a requirement where CORBA is starkly appropriate. In summary, ESTs are inherently difficult data to handle (yet extremely useful) so they are a natural target for new efforts to improve data quality, cross database browsing and novel visualization tools. Of all the biological databases, EST databases have been the most conspicuous in recent years both because of their prodigious growth rates and because of their increase in intrinsic importance connecting different areas of biological research.

ESTs have many problems that stem from their means of production. The cDNA library from which any ESTs are drawn will sample the levels of expression in a particular tissue at a particular time: rare transcipts will be missed and highly expressed genes will be overly abundant. This latter problem, redundancy, is both wasteful and difficult to handle due to the extreme volume of error prone data. Redundancy can be reduced in the laboratory using normalization techniques.

To address the redundancy problem using computer techniques, a variety of EST clustering programs have been implemented. Once clustered, gene fragments can be assembled using sequence assembly programs, or both steps can be combined in large-scale EST assembly and gene indexing protocols. An assembly stage adds two main benefits: first, it produces contigs and consensus sequences which can completely hide EST redundancy and second, it should also improve the length and quality of the gene reconstructions beyond that available from any one EST. For some applications, the clustering is all that is required, for example, the early stages of STS mapping but for others, sophisticated cluster partitioning or complex assemblies are more suitable.

Introduction to EST Clustering Algorthims Computational Complexity

A typical EST sequencing project may produce between fifty thousand and a few million ESTs which need to be compared against each other. This formidable scaling problem must be handled by whichever algorithm is chosen and before user considerations such as input requirements and presentation style can be considered. Algorithmic tools with the potential to reduce the comparision complexity include suffix trees, hashes, Finite State Machines, indexing and statistical DNA-word frequency analyses.

To be Concluded