Petaflops Computing for Molecular Biology Applications

Abstract

As we approach the year 2000, computing technology continues to forge great changes in society . However, all the changes from the past 50 years of computing are only a hint of even greater transformations to come With computers, high-end machines today commonly reach billions of operations per second and machines capable of trillions of operations are on the anvil. Beyond this short-term expectation, however,a group of leading figures in high performance computing have begun to discuss the scientific possibilities for machines a thousand times faster - one quadrillion operations per second and its impending applications in different fields. In this article, we are to discuss some of the important applications of Molecular Biology that can greatly be benefited by very high performance computing.

1. Introduction to Petaflops Computing

The paradigm of massively parallel processing and message-passing, concurrent processes has culminated in the goal of achieving teraflops-scale computing, which involved a paradigm shift from the earlier vector pipelined-based supercomputers. On December , , a sustained rate of teraflops( floating-point operations per second, or Tflops/s) was achieved by "ASCI Red", a system employing some Intel Pentium Pro processors at Sandia National Laboratory in New Mexico. Following the custom of marking advances in computing by factors of , the next major milestone is sustained rate of petaflops floating-point operations per second, or Pflops/s).

In addition to prodigiously high computational performance, such systems must ,out of necessity, feature very large main memories, between Tbytes( byte ) and Pbyte(), depending on application, as well as commensurate I/O bandwidth and huge mass storage facilities.

The current consensus of scientists who have performed initial studies in this field is that affordable petaflops systems may be feasible by the year , assuming that certain key technologies continue to progress at current rates.

To get some idea of the scale of these systems, a Pflops/s computer could dispatch in seconds a computation a current desktop workstation would require a full year to perform. One petaflops is greater than the total computing power in the United States. A Pbyte of memory could contain the text of approximately one billion books, or roughly times the size of a typical university library.

There are compelling applications that need that level of power. But substantial research is needed in architectures, applications and algorithms, system software and tools, and component technology to attain the computing power of a petaflops. Just scaling up from teraflops machines is not realistic. It needs a viable paradigm shift. However, opportunities abound for more radical paradigm shifts: holographic memory, analog computers for solving differential equations, and molecular computing through processes such as DNA recombination.

For an idea of the sheer of petaflops computing, many applications have been identified.

2. Some Applications of Petaflops Computing

On applications side, with petaflops computing power, biologists will be able to combine computational needs of biological systems. Electrochemical , anatomic, and fluid dynamics models of the heart will be integrated into a single model. Repositories of biocomputing models in data archives will let biologists use the results to design new ones. Astronomers will compile billion-element digital sky surveys, neuroscientists will compile atlases of brain imaging data, and Earth scientists will collect and analyze incoming streams of remote-sensing data in real time.

In addition, technological advances such as digital libraries - complete with publication, search, and analysis capabilities; transparent access to distributed resources; and high-speed networks will open up high performance computing to new communities of users from social sciences, museums, and education.

Before discussing some of the interesting applications of Molecular Biology, here is an overview of biological sequences.

3. The origin of genetic and protein sequence data

A DNA contains the complete genetic information that defines the structure, function, development, and reproduction of an organism. Genes are regions of DNA and proteins are the products of genes. DNA is usually found in the form of double helix with two chains wound in opposite directions around a central axis. Each chain consists of a sequence of nucleotide bases which can be one of the following four types: adenine(A), cytosine(C), guanine(G),and thymine (T). The two chains of a DNA molecule have complementary bases, with A-T and C-G being the only pairings that occur, so the sequence of a second chain can be determined if the sequence of the first chain is known. In replication the chains unwind and each chain is used as a template to form a new companion chain.

The production of a protein from a segment of DNA occurs in two major steps: transcription and translation. A protein is a polymer consisting of a sequence of amino acids. Twenty different amino acids are commonly found in proteins. The DNA itself does not act as a template for protein synthesis. In transcription, a complementary RNA copy of one of the two DNA chains is formed with ribose nucleotides. RNA is a single stranded molecule similar to DNA except that the sugar backbone contains ribosome instead of deoxyribo some. During transcription, the base thymine (T) is encoded as uracil(U) while the three other bases remain the same. As a result, a sequence of RNA is composed of the bases A,C,G, and U. This RNA sequence is translated into a sequence of amino acids that combine to form a protein. During translation three bases, referred to as a codon, are read at a time and translated into one amino acid. A hypothetical DNA molecule sequence is

ATTGACGTAGTCATGACGAATGGACCC
TAACTGCATCAGTACTGCTTACCTGGG

Once synthesized, the protein chain folds according to laws of physics into a specialized form, based on the particular properties and order of the amino acids(some of which are hydrophobic, some hydrophilic, some positively charged, and some negatively charged). Although this basic coding scheme is well understood, biologists can not accurately predict the folded protein shape.

4. The role of Computing in Molecular Biology

We are in the midst of a long-range, worldwide race to map and sequence the genome of humans and other species. More than two hundred million nucleotides and amino acids of mammals, bacteria, and other life forms have been classified and stored in publicly available databases such as . Before the end of the century, scientists will be sequencing more than a billion bases a year, enormously increasing the size of these databases.to petabytes.

As these databases grow in size, biomedical researchers need computational tools to retrieve biological information from them, analyze the sequence pattern they contain, predict the three-dimensional structure of the molecules the sequence represent, reconstruct evolutionary trees from the sequence data, and track the inheritance of chromosome based on the likelihood of specific sequences occurring in different individuals. These tools will be used to learn basic facts about biology such as which sequences of DNA are used to code proteins while other combinations of DNA are not used for protein synthesis. They will also be used to understand genes and how they influence diseases. Most of the biomedical research activities benefits from very high performance computing.

DNA sequencing and Sequence Analysis

Genome projects aim to delineate genetic and physical maps of the total DNA complement of a given organism, ultimately yielding the total nucleotide sequence of this DNA. A DNA sequence is a string over the alphabet . For a human being, such a sequence has 3 billion characters. These are distributed among 23 chromosomes, each containing about 50 to 250 million nucleotides. Each chromosome encodes about 10 to 50 thousand genes.

In order to sequence a DNA molecule, a biochemist cuts this long string(copied many times) into small fragments. Each time, a randomly chosen fragment of atmost 500 characters can be sequenced, i.e., such a fragment can be totally identified. Then from a huge number of these fragments, a biochemist should reconstruct the shortest superstring representing the whole molecule. Incidently, finding the shortest superstring of a finite set of strings over an alphabet is NP-hard.

With the extraordinary advances in molecular biology over the past 20 years or so, it is now possible to read the specific sequences of individual genes and to predict by means of the genetic code the sequence of the proteins that they encode. A major challenge for molecular biology in the next decade will be to use this information to predict the actual biological function of these proteins.

Computer Scientists seeking practical applications for their fast string matching algorithms will find exciting opportunities in sequence analysis.

The essence of the problem is that a given set of DNA sequences(with each element represented by letters A,C,T, or G) requires efficient alignment matching algorithms to deal elegantly with insertion, deletion, substitution, and even gaps.

A typical problem,from the area known as a multiple-sequence alignment, seeks to globally optimize the matches of 20 sequences- each 200 base pairs of nucleotides long. That is, a partial sequence of a human protein is compared and contrasted to similar proteins to humans, insects, plants, and yeast for clues to its function and structure. Incidently, many such problems may be shown to be NP-complete, i.e., the present day supercomputers find it very difficult to give the result in real time.

Storing and retrieving biological information

Computing methods that allow the efficient and accurate processing of experimentally gathered data will play a crucial role in almost all future biological research. The biological sequences databases will serve as the profound research tools of biologists over the next decade. Thus, storing data and its associated information effectively will constitute a major project for biologists. These databases continue to be used principally for comparing DNA sequences(so-called homological searching). Also, protein databases have supported many important discoveries. Apart from sequencing data, the information like gene names, protein information(for example, constituent amino acids), and pointers to other interesting sequences near particular chromosomes are needed. Also, databases could also allow relevant articles and references to be easily associated with specific DNA and protein sequences.

Making all this information easily accessible to distributed users while effectively dealing with errors, conflicts, and updates presents a research problem of the utmost urgency.

The three specific types of database applications are collaborative, repository, and laboratory. Collaborative databases must be capable of combining databases from several laboratories working together on a single problem. The database must be designed in a manner so that it can work as a public resource for an entire research community.

Repository databases will be created as public resources to contain data from many sources. One of the major repositories of sequence data is GenBank with discrete collections of facts. It is mandatory to design a large, distributed database in which the facts are interrelated. That is, relating the information in a fashion that allows users to extract meaningful knowledge - as the information itself is updated and changed.

Laboratory databases that support only one laboratory is also necessary as researchers must be able to change database structures easily to accommodate constantly evolving data, little of which will be standard.

Meanwhile, the amount of all types of information is growing exponentially

Protein Structure Prediction

The information needed to determine how a protein folds resides completely within its amino acid sequence, the problem of predicting protein folding has been one of the most important unsolved problems. Understanding the 3D structure of proteins is vital to studying their function in living systems and designing new ones for biological and medical purposes.

The amino acid sequences of proteins are being discovered at an explosive rate. However, experimental procedures for determining their 3D structure such as x-ray crystallography and NMR spectroscopy, are slow, costly, and complex. A need exists for theoretical and computational techniques that can be used to help Determining the structure of three-dimensional molecules is an extremely difficult. In the early days, mapping molecular shapes required tedious calculations done by hand. Biologists who studied molecular shapes (crystallographers) used electronic computers by increasing the calculation speed over 300 times for determining the X-ray diffraction pattern

Yet, even with the most modern technology deducing molecular structures is undeniably labor intensive.

The protein folding problem remains unsolved because all of the biochemical rules that govern the folding and stability of proteins are not yet known. If these rules were known, a computer program could be written to stimulate the folding of a protein. In conjunction with the scientific work that is being done to understand the forces in involved in protein folding, an alternative computer approach is to write a program that searches through all the possible protein conformations to find the ideal one. However, a search through entire conformation space would require a prohibitive amount of computer time.

Thus, petaflops computing can be beneficial for simulating protein folding problem so that more possible conformations can be considered and a more realistic energy function can be computed. This work involves strategies for searching through a large number of possible structures representing different energy states. The computationally intensive parts of a simulation are the long search through the great number of possible conformations and the computation of the free energy of the structures.

Other computational problems in Molecular Biology

Reconstructing evolutionary trees from sequence data is another key computational challenge. The technique that is being actively pursued in this area is linkage analysis. This technique lets researchers determine the gross location of genes based on the characteristics of genetic recombination. The variation in the specific DNA gene sequences allows researchers to distinguish the DNA of several different individuals. The same gene is not precisely the same DNA sequence in every individual.

Also, computers are essential in tracking the inheritance of chromosomes based on the likelihood of specific sequences occurring in different individuals. The algorithms used in computing these likelihoods use heuristic and deterministic techniques, among others. All are computationally intensive.

A variety of problems like gene identification, sensitive database searches, and sequence classification, streamlining of the annotation of genomic sequences etc. needs novel computational methods and high-end machines.

Organism reconstruction problem, i.e., given a complete genome sequence of the problem is to predict computationally the development of the adult from a single cell and its continual function as a biological organism, needs high performing computers.

Conclusion

Very High-performance computing provide the computational rates necessary for advanced computing problems in Molecular Biology. Molecular biologists can greatly reduce the time it takes to complete computationally intensive tasks and take new approaches in processing their data. This advantage may allow the inclusion of more data in a calculation, the determination of a more accurate result, or the implementation of a new algorithm or more realistic model.

As seen in our discussion, petaflops computing, which represents very high performance computing, can provide the performance one need to use in large databases.

Back to my Home Page