Bioinformatics-Homology Searching Algorithms

Bioinformatics: Tools and Methods

Computers have become an essential component of modern biology. They help to manage the vast and increasing amount of biological data and continue to play an integral role in the discovery of new biological relationships. This in silica approach to biology has helped to reshape the modern biological sciences. Bioinformatics is a scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation and combines the tools and techniques of biology, physics, chemistry, computer science, information technology and mathematics. Bioinformatics has helped to make possible the current revolution in modern molecular biology.

This article is all about algorithms and implementations, packages of software tools and modules, libraries of utilities etc.

I have discussed quite a few of popular sequence analysis packages such as GCG, Omiga, BioTools Pep Tool and Gene Tool in detail here. Apart from these, there are a number of platform-dependent sequence analysis packages including Mac Vector for Macintosh systems and DNASTAR's Laser gene that runs on both Windows 98/NT and Macintosh computers. The widely used Staden package, which is also available free for academics, has been explained. This package consists of a fully integrated set of sequence analysis and assembly software tools. There are also a collection of software resources used to address some of the basic tasks of bioinformatics. I have briefly explained some of them here. The homology searching algorithms BLAST, FASTA , multiple-sequence alignment by ClustalW and phylogenetic analysis get a little attention. Also I have discussed Genotator, a very powerful sequence annotation and presentation suite that integrates the output of multiple and varied analysis into a format suitable for publication.

The Wisconsin Package of Sequence Analysis Programs

The GCG programs also called as the "Wisconsin Package" comprise a powerful suite of tools for manipulating, analyzing, and comparing nucleotide and protein sequences. The initials GCG stand for Genetics Computer Group. This package includes more than 130 programs, each of which functions as a tool for performing a specific task, such as translating a nucleotide coding sequence or determining restriction enzyme cutting sites. Most GCG programs use one file as input and write the results to another file. The output files from many GCG programs are suitable as input to other GCG or other programs. In many cases, complex problems can be solved by using several GCG programs in succession.

The Wisconsin package is installed commonly on a shared computer on a network, such as a UNIX server, so that individuals may access the programs and use them from remote locations, such as from their own personal computers or other kinds of terminals. There are several different methods for operating the GCG programs. Two methods included with the package are the command line interface, which is the traditional method in which users type the name of a GCG program to initiate an interactive program session and a graphical user interface called SeqLab, in which user open a set of windows to the GCG programs and interact graphically to select sequences and program functions. SeqLab also includes a powerful color-coded graphical multiple sequence editor. With either interface, all programs operate similarly to each other.

Also, a recently introduced Web-based interface, called SeqWeb, is available from GCG and allows users to run GCG programs and manipulate sequence files through a Web browser such as Netscape Communicator or Internet Explorer. The complete GCG package also includes a full set of nucleic acid and protein sequence databases. The sequences in the databases are suitable for direct submission into the GCG programs for analysis, manipulation, or comparison. Also included are complete sets of user manuals, in both printed and online Web-based versions. A minimum of 15 gigabytes of hard disk space is needed to install and maintain the Wisconsin Package with its entire set of databases. The disk space requirement is increasing rapidly with the expansion of the databases. It is also suggested that a minimum of 128 megabytes of core memory be provided with 200 megabytes of virtual memory.

The GCG programs can be operated directly on the console of the Unix computer or from remote workstations. To operate the programs from the command line, a terminal or PC running telnet software with VT100 terminal emulation is suggested. To operate SeqLab, X-windows terminals or personal computers running X-Windows server software are needed. Most GCG program results are written to ordinary text (ASCII) files. The text files can be imported into any text or word processor for further manipulation. In addition, many GCG programs result in graphical output, such as restriction maps or RNA secondary-structure predictions.

Comparison Methods

Pairwise Comparison

These programs compare one sequence with a second sequence. The choices available include creating the best overall, i.e., global alignment of the two sequences (GaP), finding the best segment of similarity between two sequences (BestFit), or creating a X/Y plot of sequence similarity (Compare/DotPlot).

Multiple Comparison

The Pileup program creates multiple sequence alignments from groups of related sequences using progressive, pair wise alignments. Other programs in this group allow manual editing of the aligned sequences (SeqLab), display various attributes of the aligned sequences or create profiles from the aligned sequences that can be use for database searching.

Database Searching and Reference Searching

These programs (LookUp, StringSearch) can identify sequences by name, accession number, author and other kinds of key words.

Sequence Analysis

The programs in this group (BLAST, NetBLAST, FASTA, and so on) allow searches for similarity of a query sequence to those in a database. NetBLAST directly searches the databases at the National Center for Biotechnology Information. The other search the locally installed databases.

Editing and Publication

Programs in this group allow editing of single (SeqEd) or multiple(LineUp, SeqLab) sequence files, as well as preparation of sequence data for publishing or preparation of plasmid maps.

Evolution

The programs PAUPSearch, PAUPDisplay, Distances, GrowTree, and Diverge allow comparison of multiply aligned sequences for sequence similarity and phylogenetic relatedness

Fragment Assembly

The GCG fragment assembly system is a set of programs that allow entry of sequence data from a sequencing project and assembly of those data into a contiguous sequence.

Gene Finding and Pattern Recognition

More than a dozen programs are included in this group (TestCode, Frames, Motifs, and so on), which assist in identifying protein-coding regions, protein-binding motifs, direct repeats and other patterns and other similar tasks.

Importing & Exporting

Fifteen programs in this group assist in entering sequence data and converting the data between the various sequence file formats, including formats for GCG, Staden, EMBL, GenBank, IntelliGenetics, PIR, and FASTA.

Mapping

The mapping programs (Map, MapPlot, MapSort, and so on) can create and display restriction maps, open reading frame maps, peptide digestion maps, T1 ribonulease digestion maps, plasmid maps, and so on.

Primer Selection

The Prime program selects oligonucleotide primers for polymerase chain reaction (PCR) experiments and for DNA sequencing.

Protein Analysis

Programs included in the protein analysis group (Peptide Map, PepPlot, PeptideStructure and so on) assist in determining information about protein amino acid sequences, such as plotting the isoelectric point, location of functional motifs, and predictions of various aspects of protein secondary structure, including antigenicity and secretary signals.

RNA Secondary Structure

Programs in this group (Mfold, StemLoop, and so on) can predict and display in multiple formats information about RNA secondary structure as well as locate inverted repeat sequences.

Translation

The translation programs (Translate, BackTranslate, PepData, and so on) translate nucleotide sequences into peptide sequences or vice versa.

Utilities Sequence Utilities

These include several useful programs (Reverse, Shuffle, Simplify and so on) for reversing a nucleotide sequence, randomizing sequences, or replacing low complexity regions with X characters among others.

Database Utilities

Within these programs, one can create a GCG personal database from any set of sequences in GCG format, combine any set of GCG sequences into a database that can be searched with BLAST, or extract sequence fragments randomly from sequences.

Printing/Plotting Utilities

These programs (Lprint, ListFile, Figure, and so on) are used for displaying, printing, or plotting GCG results files, either text or graphic files, to various kinds of display, printing, or plotting devices.

File and Miscellaneous Utilities

A number of other utility programs (ChopUp, Replace, Reformat etc.) assist in manipulating text files, printing GCG documentation and other tasks.

A comprehensive set of sequence databases is included with the Wisconsin Package. These include the GenBank and EMBL nucleotide sequence databases and the PIR and SwissProt protein sequence databases.The sequences in the databases are in GCG file format so that they can be used directly as input for the GCG programs. Also included are various kinds of databases including restriction enzymes, scoring matrices, proteolytic enzymes, and reagents, protein analysis data files, transcription factor database, codon frequency tables, translation tables, and the PROSITE dictionary of protein sites and patterns.

As told above, SeqWeb is a Web-based interface for operating the GCG programs. The SeqWeb product includes Web server software and runs only on Unix-based computers. BioPortal, developed at the National University of Singapore, is another Web-based interface for operating the GCG programs. SeqWeb provides access to some of the most frequently used Wisconsin package programs such as Comparison, Database searching, Evolution, Gene finding and pattern recognition, Mapping, Primer selection, protein analysis, RNA secondary structure and Translation. BioPortal also offers access to a core set of some of the most frequently used GCG programs.

In addition to the interface for GCG programs, BioPortal also comes with a suite of other useful sequence analysis programs. These include CLUSTALW, PHYLIP, Primer, and ReadSeq. There are other Web-based interfaces available to GCG programs such as WWW2GCG and W2H.

Omiga

Computer-based sequence analysis, notation, and manipulation are a necessity for all molecular biologists working with any but the most simple DNA sequences. As sequence data becomes increasingly available, tools that can be used to manipulate and annotate individual sequences and sequence elements will become even more vital tool in the molecular biologist's arsenal.

The Omiga DNA and Protein sequence Analysis Software tool provides an effective and comprehensive tool for the analysis of both nucleic acid and protein sequences and runs on the ubiquitous standard PC. Omiga allows the import of sequences in several common formats. Upon importing sequences and assigning them to various projects, Omiga allows the user to produce, analyze, and edit sequence alignments. Sequences may also be queried for the presence of restriction sites, sequence motifs, and other sequence features, all of which can be added into the notations accompanying each sequence. Finally, Omiga allows rapid searches for putative coding regions as well as PCR and sequencing primers. Omiga uses a project concept to organize sequences, their individual notations, and any additional results and information generated for those sequences by the program. A project represents a simple way to organize an individual investigator's data.

Different projects can be localized onto different sectors of the hard drive. Multiple users can also use several separate projects to organize unrelated sequences. A primary task in Omiga is that of importing sequences. Obviously, importing a sequence is far easier and less prone to error to error than inputting a sequence by hand. Omiga supports several formats, including ASCII, EMBL, FASTA, GCG, GenBank, PC-Gene,and Swiss-Prot. Imported sequences are converted to the Omiga format. The Omiga format includes any additional features and information that was in the original sequence file, such as coding regions, transcription start sites, termination codons, polyadenylation signals, and so on. Upon importing new sequences, many such features may be identified based on primary sequence data alone.

How ever, it is both useful and timely to have these items already identified. Further, many of the features identified in GenBank sequences are not based on primary sequence data but on experimental data. Thus, a transcription start site, or an intron-exon boundary may be determined experimentally by comparing genomic and cDNA sequences.

The presence of such elements can not always be predicted with certainty simply by inspecting the sequence data alone, and thus these elements may not be identifiable using Omiga. For example, importing a large nucleotide sequence from GenBank such as the human protamine gene cluster would also import the additional sequence features identified in the original GenBank accession, including experimentally defined TATA and CAAT boxes, transcription start sites, exons, start and termination codons, polyadenylation signals, and exon-intron boundaries. Whereas Omiga could be used to identify ATG or TATA motifs, only experimental evidence can determine if these are the actual start codons and transcription start sites utilized for a particular gene. Omiga also imports data on repetitive elements, when they are provided with the original sequence being imported. Clearly, this is an important functionality that adds considerably to the usefulness of Omiga. As well as importing sequences, Omiga can also be used to export sequences, with their dependent feature information where appropriate, into several formats including ASCII, EMBL, FASTA and GenBank formats.

This functionality is useful when submitting sequences to the various databases. Thus, sequence data generated in a laboratory can be mined for information using Omiga. Then, upon submission of a manuscript, the sequence data, and all of the characteristic features identified using Omiga can be exported to a GenBank or EMBL format for submission to one of these databases. Omiga utilizes the Clustal W algorithm for multiple sequence alignments. Alignments can be created in Omiga by two methods. Either two or more sequences selected from the project view can be aligned or sequences can be added to existing alignments within the alignment view.

Omiga has a number of preset parameters for performing alignments, which may be changed at the user's discretion. Groups of specific parameters may be saved as alignment protocols. Thus gap penalties, weighted mis-matches, and divergent sequence delays can be selected as desired, and saved as a protocol for use when aligning numerous groups of sequences without customizing the alignment parameters each time. Omiga also offers a choice of scoring matrix, including BLOSUM, MD, PAM and Identity for protein sequence alignments.

Editing Alignments

The alignment view allows alignments to be edited, and additional notations to be added. Sequences within alignments can be grouped such as changes to one sequence are performed automatically to the other sequences within that group. This function is useful when introducing a gap at the same location in more than one sequence in an alignment that is sub optimal. It is also possible to pin a sequence, causing it to remain fixed at the top of the alignment view window regardless of scrolling.

This function is useful when visually comparing an individual sequence in an alignment with each other sequence in turn. Omiga can display either an individual sequence of particular interest, an identity/similarity line, or a consensus line. The identity/similarity line marks positions where all sequences are identical with a colon, whereas positions with very few, or very conservative deviations from the consensus are indicated with a period.

Thus, in a pair wise alignment of peptide of sequences, a lysine and an aginine will be marked by a period, because they are both basic residues, even if the codons used in the nucleotide sequence were divergent, such as AAG and CGC. Finally, in the alignment view the way in which alignments are visualized can be modified, including the use of user-selected color schemes as well as boxing and/or shading of regions of sequence conservation. This function, along with a color printer, provides for the generation of visually informative documentation for both nucleotide and peptide sequence alignments.

Searching for Sequence Motifs Identifying Nuclease and Protease Sites

Whereas sequence features and notations provided with sequences from databases are imported into Omiga along with the actual sequence itself, it is often desirable to search for and add additional features and notations. This is perhaps the primary functionality of analysis tools such as Omiga. Omiga allows the user to identify restriction sites and proteolytic sites in nucleotide and peptide sequences respectively. When searching for restriction endonuclease cleavage sites, as with searches for proteolytic sites, the user may search for each type of site in turn by inputting the actual sequence recognized by the endonuclease or protease in question.

In an alternate way, Omiga contains the REBASE database of all common restriction endonuclease sites as a PABASE database of proteolytic sites that can be used to identify all sites within a sequence. This is done by selecting either the restriction sites or proteolytic sites option from the pull-down menu generated by selecting the search feature from the tool bar. The output thus generated and displayed in the search results view can then be filtered to identify sites ideal for sub cloning reactions.

Searching for User-Defined Sequence Motifs

Omiga also performs searches for user-defined sequence motifs, and allows users to store motifs and search parameters in protocols that can be used for additional queries on other sequences. Searchers for user-defined motifs are performed in a similar fashion to searches for restriction and proteolytic sites, except that the nucleic acid motifs or protein motifs selections are chosen from the search pull-down menu. User-defined motifs may be as simple as searching for all ATG trinucleotides independent of reading frame, or as complex as identifying segments with a percentage identity to a complex hormone-receptor-binding site that lie within a predetermined distance from a previously identified transcription start site.

Upon initiating the motif search, the user is prompted for information regarding the search parameters using the nucleic acid or protein motif search parameters box. Individual user-defined search protocols can be saved and stored in databases for later use, without the need to redefine the search parameters each time. In this way, segments with a high degree of identity to reported promoters, regulatory sequences, and other common cis elements can be identified. Another potential use for motif identification involves a search for restriction endonuclease sites within PCR products. By searching for and identifying the previously designed primer sequences, the user could delimit the search for restriction fragments to show only those sites that occur between the primers.

This would be useful for identifying restriction patterns indicative of the successful amplification of a particular amplicon.

Searching for Coding Regions

Another function provided by Omiga 1 is the identification of putative coding regions within nucleic acid sequences. Searching for open reading frames is carried out by selecting the open reading frames selection from the search pull-down menu. This function is particularly useful when large pieces of genomic sequence data are being analyzed. As with the sequence alignments, individual protocols can be generated for use when searching sequences for potential coding regions. Omiga 1 comes with an internal copy of the GENMOTIFS database.

In addition, sequences characteristics of genes can be added by the user. For example, it has been suggested that uneven positional base preference can be used as an accurate indicator of expressed segments within large genomic sequences. Whereas this may be difficult to fashion into a gene-associated motif, it is conceivable that other motifs may be identified that are characteristic of coding regions. Omiga appears to search for the occurrence of statistically significant clusters of gene-associated motifs within the sequences being queried.

Primer Identification

Both sequencing and PCR primers can be designed using user-defined criteria. Default protocols are provided, however the user may also define search parameters of his or her own and has the option of saving these as additional protocols. When searching for PCR primers, the parameters over which the user has control include primer length, GC content, melting temperature, salt concentration, and primer concentration. The user may specify specific regions of the template sequence to be searched or omitted from the search.

The user may specify a 3' clamp, as well as a number of ambiguous nucleotides for degenerate primers. Another feature available for primer design is the ability to omit duplicate end points. However it is not possible to allow the user to design primers compatible with an already defined primer.

PepTool and GeneTool Platform-Independent Tools for Biological sequence analysis

PepTool and GeneTool are two new bioinformatics software packages. PepTool is designed for protein sequence analysis and GeneTool is for DNA sequence anlaysis. Both are comprehensive, integrated programs that offer the full range of analytical and graphical features typically found in many advanced commercial bioinformatics products. They also bring some much-needed advances into the bioinformatics arena in algorithm design, graphical-interface implementation, data compression, networked parallelism and Internet communication.

The interesting fact is that both are platform-independent software packages. In the following, there are some highlights for some of the useful features being offered by PepTool and GeneTool.

PepTool-Specific Program Features

Depending on the platform being used, the program may be started from either the Finder or MultiFinder for MacOS, by clicking on the Windows Start Button (for Win98/NT), or by typing peptool for Unix. After starting, an application "Launcher" appears at the top of the screen along with a Sequence Editor window at the center of the screen.

The PepTool Launcher allows the user to launch additional windows, to access Help files, to change program preferences or to contact BioTools electronically. In fact, PepTool has atleast a dozen different views or windows accessible through either the PepTool Launcher or the Sequence Editor including: a Sequence Editor; an Alignment Editor; a simple Text Editor; a Graph Viewer/Editor; a DotPlot Viewer/Editor; a Helical Wheel Viewer/Editor; Structure Viewer/Editor; a Sequence Motif Viewer/Editor; a Sequence Statistics Viewer; a Help Viewer; a Preference Editor; and a Bug Reporter.

Text files, folders or image files created with these different windows can be saved and are automatically marked with an icon and a three-letter extension in a format specific to that window.

The Sequence Editor

The function of the Sequence Editor is to serve as a central workspace from which to enter, edit, retrieve, graph, or analyze protein sequences. As such, most of PepTool's functionality is accessible through this particular window.

The Sequence Editor contains a standard set of menu items including: File (for file handling and printing functions), Edit (for editing the viewed sequence), Transfer (for transferring the sequence or selected portions thereof to other applications or windows), Search (for finding or retrieving sequences in the database), Analyze (for performing statistical or structural predictions, Graph (for plotting physiochemical properties or sequence similarities), and Help (for accessing the context-dependent hyperlinked Help System. Sequences automatically loaded or manually entered into the Sequence Editor can be saved in either Swiss-Prot, PIR, PepTool, or ASCII format.

The Editor also has the capacity to read "Foreign Format" files including GCG, IntelliGenetics, FASTA, Swiss-Prot, and NBRF-PIR as well as other common file types. The Foreign Format reader is both intelligent and general, meaning it does not require the user to know or to predesignate a given sequence file format.

Similarly, if the Foreign Format reader encounters a file format it has not seen before, it is usually capable of making a reasonable choice about how to parse the sequence from superfluous text. Also PepTool Sequence Editor supports auto spacing, auto wrapping and mouse-driven text selection for the usual cutting, pasting, copying and segment-deletion operations. It also has a text entry filter, a sequence ruler, a real-time sequence-length monitor and an editable cursor position that is updated instantly when the cursor position is changed by a mouse-click or text-entry operation. Information about the sequence and the sequence file is displayed at the top of the window and additional data such as the accession no., journal reference, date and so on) can be read or entered from a pop-up sequence reference card accessed by the Reference button on the lower right corner of the window.

A particularly useful feature of Sequence Editor is its support of color-coded secondary structure display and editing. The buttons located on the right side of the window allow users to paint secondary structure directly on to a sequence or to a precluster certain residues together when performing pair wise sequence alignments.

Database Searching

PepTool permits several kinds of sequence database searches from a variety of databases, all of which are launched from the Sequence Editor. Results from database searches can be viewed, saved or transferred using a Data Browser.

PepTool supports database queries and sequence retrieval on the basis of keywords (such as organism, protein name, accession no., partial name, or logical combinations of the above); sequence patterns; subsequence similarity (short stretches of similar sequences); and , most importantly, global sequence homology. PepTool provides the option of conducting two kinds of global homology searches- a fast one and an exhaustive one.

The fast search (FASTALIGN), which typically takes less than five minutes on a personal computer, is based on techniques similar to those described for FASTDB, FASTA, and BLAST, although it uses a specially developed scoring matrix and produces a global alignment instead of a partial local alignment, which is normally done by BLAST.

Side-by-side comparisons of FASTALIGN to FASTDB have indicated that FASTALIGN is slightly faster and more sensitive than FASTDB. The exhaustive search (NWALIGN), which typically takes several hours on a personal computer, is based on the Needleman-Wunsch algorithm.

The Alignment Editor

The Alignment Editor is an intuitive tool designed to permit the viewing, editing, and automatic generation of both pair wise and multiple sequence alignments. Typically data is transferred into this window from a Data Browser or Sequence Editor. Sequences may be transferred either individually or in groups. The alignment is being computed automatically by pressing the Compute Alignment button on the lower right corner. For this operation, PepTool uses the XALIGN algorithm which is capable of quickly aligning several hundred sequences using both sequence clustering and secondary structure information in the alignment process. A consensus sequence is automatically generated in the window above the alignment view using the threshold indicated in the Consensus Threshold box.

Manual alignment and manual editing of an automatically generated alignment can also be performed by selecting or painting over a sequence block.

The Structure Viewer

The Structure Viewer displays predicted secondary structure using specially shaded and color-coded helix and beta-sheet icons. Six different predictions are generated. A consensus result is produced based on the weighted average of all six predictions. The consensus result is typically 70% correct based on a simple three-state scoring system. The presence and location of membrane-spanning helices is also predicted. The order of the individual predictions can be rearranged by toggling a check-box at the bottom of the window and dragging the predicted structures to different locations.

The Graph Viewer

This Graph Viewer/Editor shares many features with other windows including the Helical Wheel Viewer and the DotPlot Viewer.

All three support fully scrollable displays, stepwise or regio-selective zooming and auto-scaling. Further more, all three permit the addition or deletion of text, lines, arrows, boxes, or circles to the displayed graph using a graphical palette located on the left side of the window. The Graphic Viewer is specifically designed to display such functions as hydrophobicity, hydrophobic moments, and predicted flexibility. These protein property graphs may be further edited through the Graph menu, where the user may adjust the graph color, line width, graph title, and axis titles as well as turn on or turn off the grid lines and residue labels. Through the Annotation menu the color, line width and line style for any graphical annotation except text can be also interactively selected and adjusted.

The DotPlot Viewer

Dot Matrix or Dot Plot sequence comparisons can be displayed, edited, annotated, and evaluated using PepTool's DotPlot Viewer. Pair wise comparisons between two different sequences as well as simple self-sequence comparisons are possible. The number and length of plotted diagonals can be adjusted using the editable "Stringency", "Window Size", and "Diagonal Filter" boxes.

The DotPlot viewer permits the usual zooming and annotation operations found in PepTool's other graphical viewers although, unlike the others, it does allow the sequence for selected diagonals to be viewed in the lower sequence window.

GeneTool-Specific Program Features

GeneTool shares many basic design and layout features with PepTool. However, it also has a number of important enhancements.

In particular, GeneTool supports resizable windows, resizable fonts, multifeature display, multifeature editing, print-preview annotation, and audio playback. It also handles database searching, preference selection, reference information, window zooming, and window management in a more intuitive fashion. GeneTool may be started just as with PepTool.

GeneTool has over 20 different views or windows accessible through either the GeneTool Launcher or its Sequence Editor. We are to see each of them in detail below.

The Sequence Editor

The GeneTool Sequence Editor serves as GeneTool's central operation window or central sequence worksheet. Consequently, most sequence-specific operations can be launched from this window. The GeneTool Editor maintains a similar arrangement of menu options (File, Edit, Format, Analyze, View, Transfer) and it permits the same wide choice of sequence formats to be read or saved including EMBL, GenBank, and DNA Data Bank of Japan (DDBJ) as the PepTool Editor.

To limit the proliferation of file types, the designers of GeneTool have consolidated many of the multiple file types typically generated from a given sequence analysis into a single sequence file. The previously calculated graphs, plots, simulations, or other analysis functions associated with a given sequence file can be selected and viewed using the View menu.

The GeneTool Editor permits variable character grouping (1,3,5,10, etc.), single- or double-strand display. DNA-to-RNA conversion, strand complementation, upper and lower case display, audio playback, auto spacing, auto wrapping, and mouse-driven text selection for cutting, pasting, copying, and segment deletion operations. It also supports the degenerate DNA alphabet as well as continuously updated sequence length, reading-frame, and cursor position boxes. GeneTool Editor supports a sophisticated feature display and mark-up system using an editable, scrollable, Feature Legent box. With this system, GenBank, EMBL or DDBJ sequences can be loaded and their feature tables automatically displayed using color-coded text selectors.

The chromatogram Viewer

Raw sequence data generated from automated DNA sequencers can be read, edited, and saved in a variety of formats using GeneTool's Chromatogram Viewer. In particular, data can be read directly from ABI- or SCF- formatted chromatogram files as well as GeneTool's own chromatogram format. This Viewer also supports two types of Find functions, one designed to locate ambiguous base calls and the other to locate specific subsequences.

The Exon Finder

GeneTool uses a unique method for identifying exon/intron locations in eukaryotic DNA based on the reference point logistic (RPL) method. RPL is similar to a sophisticated neural network and can be trained to recognize very complex patterns and signals, such as those found at exon/intron boundaries. This method is found to be far better than most other gene-finding algorithms, including GRAIL. Further more, RPL prediction only takes a few seconds on a standard desktop machine.

BioTools has enhanced this RPL technique by adding a database search method to fine-tune the initial exon/intron predictions.

The PCR Primer Designer

The Primer Designer is both an interactive and an automated tool for PCR primer selection and design. It may be launched either from within the Sequence Editor or from the GeneTool Launcher.

To simplify primer analysis, sequence data is always presented in a double-stranded format, with an option to display the amino acid translation between the two strands. PCR Primers may be created manually by clicking and dragging on the upper strand or the lower strand. During this operation, a primer sequence is automatically generated above or below the selected region while the primer length, product length, melting temperature, and primer score are calculated and updated in real time in the parameter boxes below. The primer score is an indication of the potential of the primer to form a good PCR oligo.

High scores indicate a good primer, whereas low scores with asterisks indicate the presence of potential false-priming sites, hairpin turns, or incompatible melting temperatures. Primers generated through this interactive mode can be subsequently edited to introduce point mutations in the same manner one would edit characters in a standard text editor. Changes to a primer sequence automatically cause a corresponding change in the translated amino acid sequence including a change in color and an update to the primer's calculated melting temperature and PCR score. This Designer also supports functions to find sequences or subsequences in both the upper and lower strands; to sort identified primers by their length, position, melting temperature, or score; to check primers for specific problems; to rename primers and to save selected primers to a text file.

Restriction Map Viewer

Essentially every gene sequence analysis package has some kind of graphical restriction map viewer and GeneTool has one such viewer. Restriction digests are normally performed from the Sequence Editor although they may be initiated from the Layout Editor and the Gel Simulation Viewer as well. Both linear and circular DNA can be processed and presented. GeneTool comes with a database of some 400 restriction enzymes although it is possible for users to create their own sub libraries of enzymes, as well as add new enzymes. Once a restriction digest has been performed, a graphical map is generated.

If sequence features have been identified previously, they are displayed as colored bars or semicircles. Clicking on any colored feature leads to that features information being displayed in a status bar at the top of the window. Once activated, that same feature may also be transferred to Sequence Editor for further analysis. In addition to the sequence feature display, enzyme cut-sites are also displayed. Clicking on any restriction enzyme label leads to pop-up box displaying a zoomed-in region of the sequence with the with the enzyme recognition sequence highlighted in red. Enzyme label with the attached site line may be moved or dragged to any position on the screen to make for a more readable or symmetric presentation.

Clicking on two enzyme names, while holding down the shift key, allows one to select the DNA sequence between the two cut sites. This graphical digest fragment may then be cut, copied, or pasted into another sequence or into another Sequence Editor. Additional annotation such as lines, circles, arcs, arrows, text etc. can be added to the map using the annotation icons on the left side of the window.

The Layout Editor

The Layout Editor offers user the opportunity to create textually complex layouts or text figures. These complex textual representations of DNA sequence data are commonly presented in published manuscripts, but typically require many tedious hours on a word processor. In an effort to reduce the difficulty associated with generating these kinds of text figures, BioTools has developed a specific Layout Editor to accelerate and simplify the editing process.

By selecting sections of DNA sequence to be formatted using the mouse and then clicking either the Grouping, CAPITALIZATION, Double Stranded, Translation, or Show Restriction sites buttons, it is possible to alter or annotate the highlighted sequence. The Translation button permits multiframe translation using either the single letter IUPAC amino acid code or the three-letter code. Similarly, the Restriction Digest button permits a textually annotated representation of restriction enzyme cut-site locations using the same dialog box and selection procedure found in the Restriction Map Viewer. Both PepTool and GeneTool offer a unique speed-up feature called networked parallelism.

Networked parallelism allows a user to run a single program or a process simultaneously allows a user to run a single program or a process simultaneously on several networked computers. The advantage to running a program on many computers as opposed to a single computer is that the program execution time can be accelerated by a factor roughly equal to the number of computers being used.

Database Compression

Protein and gene sequence databases are growing faster than hard drive capacity. It takes more than 12 CDs to hold all of the sequence data in GenBank. Fortunately, the Internet access to these huge databases without having to find a place to store > 10 Gb of data or to read a dozen CDs at a time.

However, these public servers are somewhat restricted in the types of searches that can be performed and the way that data can be saved, presented or downloaded. Further more. a growing number of university researchers and private companies are becoming increasingly concerned about Internet security and firewall breaches. The question is: How do you permit flexible database access and maintain security without the headache of purchasing a new hard drive every six months or a new CD every week? One compact answer is to use data compression technology.

BioTools has made use of the fact that most biological sequence data uses only a restricted alphabet of either four or 20 letters for proteins. This means the size of the ASCII character set can be reduced from 8 bits per character to roughly 2.3 bits for DNA sequence data and 5 bits for protein.

Further by removing blanks, empty spaces, or redundant information from the database text fields and replacing common words with special characters, a good deal of more compression can be achieved without significant loss of information. Finally, by combining multiple databases with duplicate entries into a single nonredundant database from 300 mb to 60 mb and the GenBank database from 12 gb to 3.2 gb. This means that the complete set of databases can be delivered on 2 CDs and easily stored on a regular 4 gb hard drive.

Although maintaining a local sequence database offers considerably more convenience, flexibility, and security than a remotely accessibly database, it is likely that researchers will continue to demand regular access to the NCBI's or EBI's super-fast facilities and highly integrated database features. Although many commercial packages exist for molecular sequence analysis, they are very expensive. Also many Web-based applications are available for sequence analysis but the data are found on remote servers. This client/server model is time consuming and depending upon the quality of the network due to the high amount of data being transmitted.

Thus a good alternative is to build a sequence analysis facility with all the databases stored on a local server.

Computational Approaches for Gene Identification

Genetics is gaining increasing significance as the discovery of new genes continues to have considerable impact in the field of medical sciences.

The Human Genome Project is a multidisciplinary endeavor that aims at learning the identity of every single base stored in the human genome. The genome stores the blueprints for the synthesis of a variety of proteins - the macromolecules that enable an organism to be structurally and functionally viable. The blueprint or the program for the synthesis of a single protein is called a gene, a unit of the DNA sequence that is generally between 1000 to 1000000 base pair in length based upon the complexity of the protein that it codes for.

A higher level eukaryote contains as many as 30000 - 40000 genes. It has been estimated that gene coding region accounts only 10 - 20 % of the genome. The gene identification problem is to recognize these regions from an anonymous sequence of DNA. The earlier phases of the genomic research focused on the construction of physical maps. Currently the emphasis is on intensive sequencing. This helps to study the structure and function of eukaryotic genes that may span tens or hundreds of kilo bases. Only a few percent of the total gene-span actually code for protein. This renders the detection of eukaryotic genes using the traditional approaches such as those based on cDNA selection, exon trapping, and the random cloning of cDNA, to be quite laborious for sequences that are larger than a few kilo bases. Consequently, genome sequencing centers routinely use computational approaches for exon prediction in addition to other means for detecting genes.

The gene-finding tools analyzed the DNA sequence and labeled a region to be potentially coding based upon its local codon usage, presence of ancient conserved patterns, or its significant deviations from the composition of a random sequence.

Here comes a compendium of the currently used software systems for identification of genes in anonymous segments of DNA.

1. Analysis and Annotation Tool (AAT). This tool identifies genes in a DNA sequence by comparing the sequence against protein and cDNA sequence databases. AAT includes two pairs of programs with each pair comprising of a database search and an alignment program. The first program pair is designed to compare the query sequence to the protein database, whereas the second pair performs a similar comparison against cDNA databases. The alignment programs construct a consensus of all sequence database alignments into a multiple sequence alignment to enhance the predictions of splice junctions. The sequence alignments that score low are filtered out from the results and the final protein and cDNA alignments are combined and presented to the client.

The first program pair compares a query DNA sequence against a protein database using two programs called DPS and NAP. The DPS program is used for computing high-scoring chains of segment pairs between the query DNA sequence and a protein database. The global alignment program NAP finds the optimal alignment between a DNA and the matching protein sequence. The alignment model for NAP accommodates introns and frame shifts within codons, and is thus able to identify the exact locations of introns using the (GT) and (AG) consensus for splice-site identification.

The second program pair, comprised of DDS and GAP, is used for comparing the query DNA sequence against a cDNA library. The DDS program is an improvement over the BLASTN program. The GAP program is a global alignment program that is sufficiently powerful for aligning a DNA sequence containing introns to a cDNA sequence. One of the goals of the AAT is to help in an automatic annotation of DNA sequences. This task has been done manually. The alignments between the coding regions of a DNA sequence and the existing proteins is established by BLASTX and linked to the sequence as an annotation in a posthoc manner.

This helps in providing a clue for the functional significance of a given gene as is evident in the function of the related protein sequences. But AAT performs such an alignment and is able to display it as the basis for predicting genes. Also the alignment produced by BLASTX is prone to frame shift errors. This shortcoming is overcome by AAT by the development of a customized program for DNA-protein sequence alignment.

2. Michael Zhang's Exon Finder (MZEF) This is an internal coding exon prediction program. It utilizes the method of quadratic discriminant analysis for the purpose of describing the distributions of exons and pseudoexons. In fitting a QDA, the surface that separates the distribution of exons and pseudoexons can be more accurately approximated. A brief of this algorithm is as follows: Each potential exon that matches the template of AG -> ORF -> GT is analyzed. The exons that meet a minimum length criteria are next considered to be putative exons and must be separated from the pseudoexons. The putative exons are represented using a nine-value feature vector, comprised of parameters such as, exon length, branch score, and various differences between the hexamer frequency preferences on the two sides of the donor and acceptor.

3. GENSCAN GENSCAN works by building a probabilistic model of the gene structure of human genomic sequences and applying this model to the problem of gene prediction. The probabilistic model of a gene includes the specific compositional and functional units of a eukaryotic gene, including exons, introns, splice sites, promoters, and the polyadenylation signals. The occurrence of a partial set of these units and the representation of a partial gene is supported by the implementation of the model search algorithm.

Also the predictions made by the program are not a mere reflection of the types of genes that are found in the protein databanks, but rather an independent evaluation that provides information that complements our existing knowledge. The modeling of a DNA sequence by GENSCAN is based on a generalized hidden Markov model (GHMM) that uses a double-stranded DNA and can find occurrence of multiple genes in a single sequence on either one or both DNA strands.

The program's ability to model functional signals and their interrelationships in a natural manner using the maximal dependence decomposition method is instrumental in providing it the strength for a generalized gene detection task. The text output of the program is a list of one or more predicted genes and peptide sequences, whereas the graphical outputs provide a representation of the relative locations of the predicted exons.

4. VEIL VEIL (Viterbi Exon-Intron Locator) is based on the observation that the hidden Markov models (HMMs) provide a precise probabilistic method for modeling sequences of discrete data.

Consequently, it uses a custom-designed HMM to segment uncharacterized genomic DNA sequences into exons, introns, intergenic regions. The exon-HMM module is designed to capture the regulatories in codon usage and periodicity that appear in the exons as well as to rule out in-frame stop codons. A similar module represents the intron-HMM. The HMM models for the probabilistic representation of the splice sites resemble a pipeline, as these signals are of a well-defined length. Other HMMs included those for the start codon, and the polyadenylation signal AAATAA, and intergenic regions that upstream of the start codon.

These simpler models were put together into the overall gene model. After the determination of these regions, the Viterbi algorithm is used for parsing the query sequence into its component exons and introns. Also, the probability that the model will produce a given sequence is computed by the Viterbi algorithm. This represents the possibility that a given DNA sequence contains a gene.

5. MORGAN MORGAN (Multiframe Optimal Rule-Based Gene Analyzer) distinguishes itself from the other systems for finding gene by using a decision tree classifier. Decision trees are often utilized for a variety of classification tasks, such as cancer diagnosis, speech and image understanding, optical character recognition. Decision trees are often applied to objects represented in terms of their features. The representation of an object in a d-dimensional feature space may be denoted as f1,f2, ...fd. Subsequently, the knowledge about the classification process is embedded into a tree-like structure and by performing a series of tests, the identity of an unknown object is established.

Thus, each question node in the decision tree corresponds to a linear discriminant, and helps in partitioning the search space into a set of compartments or leaves which individually represent an entity that we are interested in classifying. Similarly, there were two partitions created by MORGAN in gene classification. These are labeled as C for coding and N for non-coding.

Homology Searching

The central goals of the human and model organism genome projects are to completely map and sequence the genes of these organisms. As work progresses, identification of the biochemical function of newly sequenced genes becomes a major challenge. Identification of gene function using traditional biochemical methods can be extremely slow and laborious task that can take years of effort even for a single gene. Also, as the rate of DNA sequencing increases manifold, analysis by sequence similarity search will need to become much more efficient in terms of sensitivity, automation potential and consistency in annotation. Fortunately, computational methods are available that can greatly facilitate the identification of gene function. When a gene is isolated and sequenced, it can be matched against one or more of the publicly available sequence databases, such as GenBank. If a similar gene of known function can be identified in such a data base search, then the function of the newly sequenced gene can be surmised by analogy.

The biochemical functions of a growing number of genes, including a number of inherited human disease genes are being determined in this way. Currently, high-speed heuristic methods, such as the hash-coding (k-tuple) algorithm employed by FASTA and the approximate word match algorithm employed by BLAST are the most commonly used sequence data base search programs. They are very effective and reliable computational tools for exon identification and gene prediction.

These programs produce a list of the sequence identifiers (e.g. locus names and accession numbers) and title lines of statistically significant matches followed by a display of the alignments of the query with each of the matched sequences. But the rapidly growing number of sequences in GenBank, as well as the size and complexity of genomic query sequences, have strained the capabilities of the traditional BLAST search interface and network server.

Some of the difficulties are as follows:

The presence of repetitive elements in the query greatly complicates interpretation
Large query sequences exceed practical memory limitations imposed by the BLAST server
GenBank now contains from so many different organism that output can be extremely complex and/or redundant.
GenBank contains so many sequences that hit lists are much too large for manual browsing
Truncated definition lines of matching database sequences can be ambiguous and uninformative

PowerBLAST was developed to meet the increasing demand for a more powerful tool that facilitates efficient and sophisticated analysis and automation of annotation. This program performs various types of query masking to reduce or eliminate spurious or misleading results. PowerBLAST postprocesses the BLAST search results to generate organism-specific results and more sensitive gapped alignments.

It offers a flexible and convenient user interface that supports

1. Batch submission of query sequences
2. Search against multiple databases
3. Simultaneous searches with multiple BLAST programs.

The results are displayed as multiple alignments with annotated features, derived from the GenBank records of matching sequences. All of the results may be viewed as text files in ASCII format or as web pages with HTML links to various database records.

For a sequence data base search result to be informative, two criteria must be met:

1. The query sequence must have a statistically significant match to a data base sequence( a score greater than one expected by chance alone)
2. there must be information available about the function of the sequence matched.

It is quite common that the functions of matched sequences are not obvious from the search results. Often sequence titles are uninformative and one must laboriously retrieve and scan the full sequence data base reports to look for annotations that may identify the biological functions of the matched sequence. Also, functionally important conserved domains such as enzyme active sites are not noted as such in sequence data base records.

BEAUTY addresses this latter aspect of the sequence identification problem, providing information about the function of the data base sequences matched in BLAST searches. BEAUTY incorporates information on sequence family membership, the location of the conserved regions, and annotated domains and sites directly into BLAST search results.These enhancements make it much easier to identify the functions of matched sequences, which is particularly important when trying to analyze the biological significance of weak data base hits. BEAUTY performs a BLAST search of sequence databases for which compiled functional information on each sequence is available. BEAUTY incorporates this information directly into BLAST search results by adding several new tables and figures to the standard BLAST output files.

First, a table is added that lists for each data base hit

1. the sequence family to which the data base sequence belongs
2. the number of sequences within each family matched in the search
3. the total number of sequences in the family.

This table allows one to quickly assess the number of different facilities matched in the search as well as the number of family members matched for each sequence family identified. Matches to only a single member in a sequence family are much more likely to be indicative of a random or spurious similarity, whereas data base hits that include all or most members of a sequence family would provide more convincing evidence that the query sequence is indeed related to that sequence family.

A new figure is then added for each data base sequence matched. This figure shows the locations of each of the local BLAST hits within the query sequence and allows one to quickly assess if the hits occur primarily in one or a few local regions or if the hits are scattered throughout the sequence. Multiple hits within the same region of a query sequence may indicate a functionally important domain within that region. The query sequence is also compared with the PROSITE pattern data base, and the location of any matched patterns are displayed in the same figure. This allows one to immediately correlate the locations of hits with the locations of potential functional domains identified by PROSITE motif matches.

Third, for each data base sequence matched in a search, a figure is added that shows the locations of the local BLAST hits with respect to the positions of the conserved regions and any annotated sites and domains for that data base sequence. These figures allow one to quickly assess if any potential functionally important regions have been hit during the data base search. Hits within all or most of the known conserved domains within a data base sequence are much more likely to be functionally important than hits within nonconserved regions. Also, query sequences with weak matches against some or all of the conserved domains within a data base sequence are much more likely to be related than cases where only nonconserved regions are matched sequences.

Matches to conserved regions can also help identify potential functionally important domains in those cases where no annotation of functional domains is provided in the data base reports. When annotated domain information is available, this figure also allows a researcher to directly correlate the locations of such domains with the positions of all local BLAST hits within the sequence. In addition, hits matching known domains and sites are readily discernible without looking up the individual data base reports for each sequence. Apart from these improvements, BEAUTY search results returned by our WWW search interface hypertext links to a number of in-house and external on-line resources. Using these links, additional functional information on matched sequences can be assessed immediately.

For example, links are provided to the NCBI's WWW Entrez interface, allowing MEDLINE literature abstracts referenced in sequence reports to be retrieved immediately and browsed for more detailed information on matched data base sequences. Links to the SRS WWW interface allow information cross-referenced from >30 linked data bases (EMBL, GenBank, SWISSPROT, PIR, PDB ..) to be obtained similarly. Using the WWW interface, all of this linked information can be easily browsed for biological meaning without the distraction of performing keyword searches separately for each of these individual on-line resources, then storing each of the results. As a result, thoroughly analyzing BEAUTY search results can take significantly less time than analyzing a corresponding standard BLAST search.

BEAUTY (BLAST Enhanced Alignment UtiliTY)

It is an enhanced version of NCBI's BLAST data base search tool that facilitates identification of the functions of matched sequences.

There are data bases of conserved regions and functional domains for protein sequences in NCBI's Entrez database. BEAUTY allows this information to be incorporated directly into BLAST search results. A Conserved Regions Data Base, containing the locations of conserved regions within Entrez protein sequences has been designed by clustering the entire database into families, by aligning each family using PIMA multiple sequence alignment, and by scanning the multiple alignments to locate the conserved regions within each aligned sequences. A separate Annoted Domains Data Base was constructed by extracting the locations of all annotated domains and sites from sequences represented in the Entrez, PROSITE, BLOCKS and PRINTS data bases.

BEAUTY performs a BLAST search of those Entrez sequences with conserved regions and/or annotated domains. BEAUTY then uses the information from the Conserved Regions and Annotated Domains databases to generate, for each matched sequence, a schematic display that allows one to directly compare the relative locations of

the conserved regions,
annotated domains and sites,
the local aligned regions matched in the BLAST search.

In addition, BEAUTY search results include WWW hypertext links of matched sequences. This convenient integration of protein families, conserved regions, annotated domains, alignment displays, and WWW resources greatly enhances the biological informativeness of sequence similarity searches.

PowerBLAST

PowerBLAST is a software tool for addressing the efficient analysis of similarity search. It includes a number of options for masking repetitive elements and low complexity subsequences. It also has the capacity to restrict the search to any level of NCBI's taxonomy index, thus supporting "comparative genomics" applications. Postprocessing of the BLAST output using the SIM series of algorithms produces optimal, gapped alignments, and multiple alignments when a region of the query sequence matches multiple database sequences.

PowerBLAST is also capable of processing sequences of any length because it divides long query sequences into overlapping fragments and then merges the results after searching. The results may be viewed graphically as a textual representation or as an HTML page with links to GenBank and Entrez. For matching database sequences, annotated features are superimposed on the aligned query sequence in the output, thus greatly increasing the ease of interpretation. Such features may be used for automated annotation of new sequence because PowerBLAST output in ASN.1 form may be dragged and dropped into NCBI's Sequin program for sequence annotation and submission.PowerBLAST is capable of analyzing and annotating a 100-kb query in 60 min on NCBI's BLAST server.

Major Internet Resources for current information on Bioinformatics and Computational Biology

Staying current with the computational biology literature has eased appreciably over the last few years, as information providers let out more of their information via the Web. MEDLINE searchable database MEDLINE, the most accessible and used source for locating the biomedical literature, contains bibliographic citations with abstracts from approx 3900 biomedical journals published all over the world. MEDLINE databases are made available through a variety of different search interfaces and in a variety of formats such as CDs.

Also MEDLINE is freely available via the Web through two different interfaces, the Internet Grateful Med and PubMed . Entrez was designed by NCBI to integrate access to DNA and protein sequence databases along with taxonomy, genome and protein structure information. Entrez also contains direct access to MEDLINE articles describing sequences. Apart from MEDLINE, there are a couple of searchable databases.

The Science Citation Index (SCI) is a multidisciplinary database of bibliographic information provided by the Institute of Scientific Information (ISI). It provides an important search enhancement to the literature that can be obtained from the MEDLINE database. It includes many biomedical sciences journals not indexed in MEDLINE. Subject coverage in SCI includes all scientific and technical disciplines.

Approximately 3500 of the world's leading journals are included in SCI. The electronic version of SCI is available via a number of methods, including CDs, published once in a month, and magnetic tape editions, updated weekly. Searching SCI on the Web of Science site is not as flexible as PubMed MEDLINE. Also SCI does not have the convenient links to molecular and genetic information from the Entrez databases like PubMed. Also SCI is not free. The initial screen for the Web of Science search interface presents two options for searching SCI. The last one is Current Contents (CC), which provides the tables of contents from more than 7000 journals and 2000 books and conference proceedings of different subjects including bioinformatics and computational biology.

The database is available in a variety of formats: ftp deliveries, floppy disks, and CDs. The CC can be accessed on the Web through Current Contents Connect system provided by ISI.

Electronic journals

1. Nature have become available in full text on the Internet. It publishes a lot of latest information, research articles and results, on bioinformatics and computational biology.
2. Science , available full text on the Internet, is an another wonderful magazine with a number of valuable resources on computational biology.
3. Proceedings of the National Academy of Science (PNAS) comes with a number of valuable articles and research results on computational biology.

Click here for Bioinformatics Links