Java and CORBA for Bioinformatics

1. Introduction

Scientists trying to find cures for genetic diseases such as cancer, heart disease and Aizheimer's disease should be able to get needed research and other vital information faster. This critical requirement is becoming possible with the arrival of both Java and CORBA technologies. Java facilitates object-oriented and component-based software development whereas CORBA brings seamless integration of thus-developed, geographically distributed, purposeful, robust and innovative objects and components.

Recently there has been a lot of talk about Java and that too for good reason. This innovative technology provides a new kind of cross-platform computing environment that can be placed on top of - and work with - other existing systems and networks.

The result is a powerful computer platform and language that can be used in ways that go beyond desktop computing. Look for it to show up in personal digital assistants, cellular phones, cars - anything that uses a microprocessor or microcontroller. Research scientists all over the world are leading the way with new and exciting Java projects that are making the computer world to wake up and smell the coffee.

In this paper, we have discussed some general features of Java and CORBA, and explained briefly about some Java-based molecular biology tools and packages. Also we have discussed about the significance of Java for developing software for different challenging bioinformatics applications. Further on, we have explained what is called interoperability of biological databases and its accomplishment by combining the two hottest technologies in this earth planet, namely, Java and CORBA. Finally we have given some practical applications of Java for information technology requirements of molecular biology and the benefits of Java for the data-rich field of molecular biology.

2. The best features of Java and CORBA

The Java language is a general-purpose programming language for platform-independent software development. As described in Java white papers, ``Java is simple, object-oriented, distributed, interpreted, robust, secure, architecture-neutral, portable, high-performance, multithreaded and dynamic.''

Java is also a valuable programming language for distributed network environments. Programmers for the Internet regard Java as a very good tool. For software developers who may not depend on networks, Java helps them to produce bug-free code. It offers features like automated garbage collection, type-safe references, and multithreading to ease to the task of developing robust, reliable, quality and complex software systems. The features make it easier to construct readable and manageable software.

Java has come out with many novel features such as exception handling, object reflection, and inner class to make programming simpler and easier. The objects created in a Java program can tell information about themselves. Control transfers in a Java program can be based on various types of information, including exceptions. The inner class feature of the Java language brings the structure of a Java program closer to a real-world application that the program models.

The Common Object Request Broker Architecture ( CORBA[13]) is a set of industry standards for distributed object-based computing that is designed to facilitate reliable, platform-independent execution of object-oriented software in wide- and local area network environments. The key element of CORBA technology is the Object Request Broker( ORB), which acts as a software bus managing access to and from objects in an application, linking them to other objects, monitoring their function, tracking their location and managing communications with other ORBs. An ORB is a form of middleware under the control of an application's operating system - Unix, Windows NT, Linux, VxWorks, etc. CORBA objects are application software modules typically written in any programming language such as C++ or Java. They can represent application code, text, graphics, audio, control parameters, algorithm inputs and so on.

The ORB is the main mechanism for simplifying the development of CORBA standard applications. The simplification is a result of three properties: location independence and platform and language interoperability. Location independence means that an ORB treats all objects, it is aware of, as local objects, even if they exist on remote systems. Platform interoperability means that objects created on one hardware/software computing platform(for example, those generated on a Pentium-based Windows NT system) can run on any other CORBA-equipped platform.

Language interoperability means that objects written in one language can interact with applications written in another, thanks to CORBA's Interface Definition Language( IDL). Objects themselves can be coded in any common language(C, C++, Smalltalk, Java, ..) and still work on non-native systems. CORBA also includes mechanisms for communication among objects across a network. The General Inter-ORB Protocol( GIOP) specifies message formats and data representations that ensure object interoperability among ORBs. The Internet Inter-ORB Protocol( IIOP) defines the specific details for using GIOP over TCP/IP.

CORBA makes it possible to harvest the software-reuse benefits of object-oriented computing and to create applications that can run on heterogeneous platform and language environments.

3. An Overview of Bioinformatics and its Applications

In this section, we define bioinformatics, its origin and real-world applications. Bioinformatics is a new term referring to the discipline that employs computers to gather, store, retrieve, analyze and assist in understanding biological information. The need for this approach is being kindled by an unprecedented growth in quantity and diversity of biological information being unraveled in the molecular biology labs in universities, government institutions and pharmaceutical firms. Some of the most important applications are those directed to understanding:

Interest in bioinformatics has been fueled in recent years by international efforts underway to determine the sequence of all genes in a variety of organisms, including humans. Gene sequences are the codes which direct the production of proteins that in turn regulate all life processes. Thus, in principle, determination of those sequences can lead to a much fuller understanding of many mysterious biological processes. At this time, although we understand how the information contained in gene sequences is converted to specific proteins, our understanding of the role and function of most proteins is at best incomplete, and often non-existent. Thus, there is a great need to understand what protein each gene produces and to determine the role of each protein.

The other important bioinformatics applications include

  1. Creation of software packages for general analysis, phylogeny inference and linkage analysis.
  2. Designing software tools for databases searching and multiple alignment.
  3. Developing algorithms and implementations for Modeling and 3D structure retrieval of Proteins.
  4. Developing software tools for biological data visualization and browsing

4. Some Java-based Tools and Packages

Here we gives an overview of some of the bioinformatics tools, software agents and some applications designed using Java.

In the past, web-based interfaces to the databases[5] and tools used in bioinformtics were greatly restricted in the interactivity they provided. A user might be able to request information and pose queries through a ``fill in the blanks'' form-based interface. The information would be returned either as another web page or as an email message. Interacting with the resulting data consists of navigating through it in a web browser or scrolling through the tens or hundreds of pages of results in a text editor.

Using Java, however, it is possible to implement widgets and have them run as applets on any PC equipped with a web browser. Since Java support is an integral part of all major web browsers, it is possible to implement almost any imaginable user interface in this manner. For this reason, among others, the bioWidget project has chosen an object-oriented approach to visualization component design and has standardized on the Java language for its specification and implementation efforts.

The bioWidget architecture is an adaptation of the Model-View-Controller(MVC) paradigm. A model is an instance of domain data(e.g. DNA sequence data); a ``view'' is a visual representation of the model(an application may include more than one view of a model); and a ``controller'' interprets external input(e.g. from a user) and updates a model accordingly. When the model is updated, the view receives notification and updates its representation accordingly.

The model is defined as a set of Java interfaces. Here is the interface that defines the model of ungapped sequence used by the sequence widget:

public interface SequenceReadOnly
{
public int getFirstChar();
public int getLastChar();
public String getSubSeq(int first, int last);
public Interval[] getSelectedIntervals();

}


The authors have constructed bioWidgets and applications for displaying the following kinds of data:

  • JaMBW (http://www.EMBL-Heidelberg.DE/JaMBW/), the Java based Molecular Biologist's Workbench, is capable of accomplishing the most common bioinformatics operations that a molecular biologist currently has. The salient features include point-and-click, drag-and-drop, plug-and-play. Most of these requirements are met by the use of Java programming language conformant to the JDK 1.0.2 specifications.

  • Jmol (http://www.openscience.org/jmol/) is an open source Java/Swing based molecular dynamics viewer and editor. It is collaboratively developed visualization and measurement tool for chemical scientists.

  • JaDis [2] is a Java application for computing evolutionary distances between nucleic acid sequences and G+C base frequencies. It allows specific comparison of coding sequences, of non-coding sequences or of a non-coding sequence with coding sequences.

  • CINEMA [6] is a new editor for manipulating and generating multiple sequence alignments. The program provides both an interface to existing databases of alignments on the Internet and a tool for constructing and modifying alignments locally. It has been coded in Java.

  • DINAMO (http://tito.ucsc.edu/dinamo/) is an interactive protein model building tool. It allows the user to build simple, three-dimensional models of proteins based on their sequence similarity or predicted fold similarity to proteins whose structures have been solved experimentally. The central parts of DINAMO are an interactive sequence alignment editor and a three-dimensional molecular graphics display of the protein being modeled. DINAMO is a web-based tool that may also be run locally. The alignment editor and assessment portions are written in Java.

  • Zomit [9] is for biological data visualization and browsing that allows navigation in very large databases in an intuitive way. It provides an application programming interface for developing servers for such navigation and visualization, and a generic architecture-independent client(Java applet) that queries such servers.
  • Apart from the tools described above, there are some projects related with bioinformatics that was accomplished by using Java.

    Yoshio Tateno and his team [12] have developed a genome information broker using a Java applet in order to facilitate graphic, dynamic and interactive processing of genome sequence data and to display them on the computer screen. The Java applet is employed for processing genome information that is retrieved from the DNA Data Bank of Japan(DDBJ) database by a CGI program.

    Andrey Rzhetsky [11] et al. have described two Java applets which are useful for insightful presentation of intermediate experimental data in gene discovery projects involving large scale sequencing. One of these applets provides a physical map of genomic region and provides easy access to the second applet, which furnishes a detailed map of sequence contigs associated with clones on the physical map.

    Martin Senger (http://www.hgmp.mrc.ac.uk/CCP11/) has developed AppLab, a CORBA-Java based application wrapper. Bioinformaticians are dependent upon many vast databases and hundreds of applications to analyze their data. These analysis tools use sophisticated algorithms and data access methods but often suffer from a lack of factors necessary to provide a scalable, flexible and user-friendly distributed application environment. These factors are

    AppLab addresses most of the issues above. It is an automatically generated wrapper for command-line driven applications that provides a uniform graphical interface for almost any analysis tool and aims to use domain standards as soon as they are adopted.

    Andrei Grigoriev et al.[3] have designed an Java applet to display genomic maps and serve as an interactive WWW graphical user interface to databases containing positional data on mapped objects or analytical tools producing positional output. Its function are as follows:

    There is a Bioinformatics Java/CORBA working group at
    (http://info.gdb.org/ letovsky/jcwg.html).

    There is a group of people trying to implement a project called BioORB. The goals of this project are to understand some of the issues surrounding the task of defining standard CORBA objects for a particular domain and to learn and use Java as the programming language of choice for distributed objects. The resources required for a Java-CORBA(BioORB) implementation has been divided into three categories: computing, program language and ORB. The computing resources are Sun workstations and X-terminals, with a Sparc20 server running Solaris 2.5. Java is the obvious choice as programming language and the ORB is chosen among Java IDL , OrbixWeb 2.0.1 and Visibroker for Java 3.0.

    DNA sequence chromatograms(traces) are the primary data source for all large-scale genomic and Expressed Sequence Tags(EST) sequencing projects. To provide efficient global access to DNA traces, Jeremy D.Parsons et al. (http://www.ebi.ac.uk/ jparsons) designed a client/server system based on flexible Java components integrated into other applications including an applet for use in a WWW browser and a stand-alone trace viewer. Client/Server interaction is facilitated by CORBA middleware. The Java trace viewing applet has been developed into a set of trace viewing tools with each component filling a different software niche.

    Biojava-I (http://biojava.org/mailman/listinfo/biojava-l) is a general purpose unmoderated discussion list set up to discuss life science related Java programming efforts at http://www.cis.udel.edu/ vagrawal/bioinformatics/code/java/best4javabioresources.

    5. The Role of Java and CORBA for Interpretation of Biological data

    The vehicle of choice for distributing bio-information is currently the Internet, specifically via the World Wide Web. Access to data via Web browsers is now almost universal. Adding to the value of the `Information Super-highway' in shipping data, we are now seeing a shift of emphasis away from the dissemination of information per se to the use of that information in transmitting concepts. This is true, for example, in the pharmaceutical industry, where extraction of information about potential structural or functional sites from biological sequence data is now a vital component of drug-discovery protocols. Genetic and protein sequence data are set to become the source of most new drug targets in the next century. The need for tools with which to interpret those data in informative and readily accessible ways is, therefore, urgent.

    A particular power of the WWW is its ability to transmit, and for browsers to display, images. Several resources are now available that exploit images to visualize different types of biological information. But the images are static. And, while linking information within different resources has revolutionized the way we access data, visualization and interactive manipulation of data are now seen as key goals in allowing users to get the most from their bio-information. Also for the sequence analyst, a vital tool is an alignment editor. But the current alignment programs present problems, as there is no standard format for output, storage and distribution of multiple sequence alignments.

    Information technology managers at the European Bioinformatics Institute(EBI) in Cambridge, England, are giving scientists world wide direct online access to the largest database of DNA information in existence.

    Instead of current practice of submitting queries and having workers at the institute sift through the information in the database, more than 10,000 scientists in academic, pharmaceutical and biotechnology laboratories will be able to fire up their browsers and do more detailed searches themselves. They will have access to the information because developers are using the Java programming language and CORBA, the cross-platform plumbing that connect databases, clients and servers, to make the legacy information easily available to the users.

    The discovery of Java technology began to address a number of the aforementioned scientific necessities. Java-capable browsers can run applets on a variety of platforms. To an extent, this obviates the need to distribute code, as software is loaded on-the-fly from the server, and cached for that session by that browser.

    6. Interoperability of Biological Databases

    In this section, we are to discuss about interoperability of biological databases, why it is significant and the share of Java and CORBA in accomplishing the true interoperability.

    6.1 What is Interoperability of Databases

    Biological sequence data is filling data sources at an exponential rate and new projects are being initiated on a yearly basis. This means that number of databases used to store this type of data is growing world wide. Currently there are approximately 100 molecular biology databases located all over the world. Also these databases are maintained on heterogeneous computer systems using heterogeneous database management systems which include Relational DBMS, Object-oriented DBMS, flat-files as well as home-made systems, which often do not follow international standards. The data are of different nature(mapping, sequence, function, proteins, metabolic pathways, ..). More over these data present a great number of internal relationships(e.g. orthology, gene mapping, gene regulation, ..).

    Although individual databases are valuable in their own right, interconnection of databases is providing a federated information infrastructure for molecular biology. This infrastructure is giving biologists an integrated view of diverse, synergistic sources of information and enabling them to answer questions that were previously laborious or impossible to tackle.

    The value of an integrated collection of distributed molecular biology databases is greater than the sum of the component databases for various reasons:

    1. Biological data are more meaningful in context and no single database supplies a complete context for any datum
    2. New biological theories and standards are derived by generalizing across a multitude of examples from different databases. Biological discovery can be achieved by integrating and analyzing existing data, as well as by generating new data
    3. Integration of related data enables data validation and consistency checking.

    The goal of database interoperation is to allow an user to interact with every member of a collection of molecular biology databases and with the collection as a whole as seamlessly as it is currently possible to interact with any single database.

    There are a couple of approaches towards achieving true interoperability among various data sources:

    6.2 The Role of Java and CORBA in Interoperability

    In this section, we discuss the proposal for addressing the interoperability concerns:

    1. A common protocol that offers a transparent access to remote databases
    2. A common application programming interface
    3. A common schema describing all the data and their relationships.

    Emmanuel Barillot et al. [1] have given a viable proposal for a standard common CORBA interface for genome maps. This interface has been defined as a consensus among laboratories involved in maintaining databases for genome mapping. They have implemented and tested on two distant and different genome map databases: the relational database RHdb and the object oriented database HuGeMap. They have implemented two Java client programs at EBI and at Infobiogen that fetch the maps in the two databases through CORBA interfaces:

    Defining common access methods for biological data and interconnecting biological databases is essential to future genomic research. Any genome map database designed with the above mentioned interface can easily interoperate with these databases.

    We will give the remaining details in the final version of this paper.

    6.3 The Importance of Componentry in Meeting the Bioinformatics Challenges

    Many investigators have been advocating the use of componentry as a technique for constructing genome informatics systems. Components are independently developed programs(such as databases, user interfaces, analysis programs, and programs associated with laboratory instruments) that are designed to be used as modular, ``plug-and-play'' building blocks. Component-based systems are informatics systems constructed in a modular fashion from components. Thus the time has come for a component-based revolution in bioinformatics.

    The software technology, including the World Wide Web, Java and its diverse facilities, and other object-based component architectures such as CORBA will drive the effort. The growing abundance of data in need of analysis, the commonality of visualization needs across genomics applications and laboratory environments and the limits of developer resources will combine to create an intense market for GUI components. Thus Java with its power of object-orientation and CORBA with packaging and distributed technological power are to serve as a solid platform for the emerging computer applications of molecular biology.

    7. Benefits of Java and CORBA for Bioinformatics Applications

    What benefits can bioinformatics expect from Java/CORBA?. Most of the benefits are at the computer scientist(developer) level but they will also greatly affect biologists. As the use of the Internet continues to expand rapidly, applications are required to be instantly deployable and maintainable across a variety of different hardware and software platforms. Since Java has the features supporting these requirements, it has become the main programming language for various applications in different fields, especially in bioinformatics.

    A main feature of Java is its applet facility. The advantage here is that installation, upgrading and cross-platform compatibility are handled automatically. The other factors that should be considered for developing computationally biological applications is designing graphical user interfaces and the availability of some nice and easy-to-use Java toolkits like Abstract Windows Toolkit(AWT) and the latest library called Java Swing.

    Design features such as simplicity, object-oriented programming, and security restrictions allow Java to expand the capabilities of HTML by offering a more versatile user interface that includes dynamic annotations and graphics. Java also allows the client to perform more sophisticated information processing and computation than is usually associated with Web applications.

    In addition, TCP/IP network programming and multithread programming, a sort of multitask programming, can be implemented easily in Java.

    Java has a relatively simple syntax and represents a number of improvements over similar languages. Java also provides a standard library of components(e.g, buttons, labels, menus) for user interfaces.

    Most recent CORBA products provide IDL to Java mappings and the OMG has been carrying out standardization of the IDL to Java mapping in order to cope with the growing need for distributed, powerful, structured, portable, fault tolerant and reusable components in heterogeneous environments.

    Java/CORBA bring to the developer a high level of portability and reusability. Portability makes a program immediately available to all computer platforms. The reusability of the distributed objects that constitute a program also make life much easier as they need not reinvent the wheel for each program. The other benefit is that if users do not need a full package, they will be able to only buy the parts they need. Finally, the use of the Web as client/server will create resources instantaneously accessible over the Internet.

    Java/CORBA means more creativity, easy development at lower cost, instant Internet accessibility and better fitting to the needs of the user. The Internet is the extreme example of both distribution and heterogeneity and is described as being host to the ``Object Web'' where Java and CORBA complement each other abilities to create globally accessible interactive objects.

    8. Conclusion

    In the near future, several whole genomes including human will be sequenced; more potent and easier-to-use genetic markers will be reduced to practice; systematic techniques for determining protein-protein interactions will be perfected; systematic gene expression mapping will be possible. Also, the invention of automated sequencers induced an avalanche of genome sequencing projects generating large amounts of data on a daily basis. Handling, reviewing and making sense of that data turned out to be a separate problem, one that is often confounded by a diversity of computers and operating systems associated with different laboratories participating in a common research project. Thus, biologists and computer scientists are bound to face an explosion of new robust applications arising out of this accumulation of biological data.

    We have discussed some of the bioinformatics applications accomplished through Java and explained that Java combining with CORBA would serve as a solid software development platform to enhance utilization, management, and interoperation among biological resources and for the ever growing field of bioinformatics.


    Back to my Home Page