Computational Biology - Applications and Approaches

There has been a dramatic increase in the number of completely sequenced genomes as a result of the efforts both of public genome agencies and the pharmaceutical industry. The availability of completely sequenced genomes permits more systematic analyses of genes, evolution and genome function than was otherwise possible. Using computational methods - which are used to identify genes and their functions including statistics, sequence similarity, motifs, profiles, protein folds and probabilistic models - it is possible to develop characteristic genome structures, assign functions to genes, identify pathogenic genes, identify metabolic pathways, develop diagnostic probes and discover potential drug-binding sites. All of these directions are critical to understanding bacterial growth, pathogenicity and host-pathogen interactions.

Introduction

Deoxyribonucleic acid (DNA) is a macromolecular chain of nucleotides that serves as a basic carrier of genetic information and is able to self-replicate. DNA can be represented as a sequence of nucleotide bases. DNA sequences are typically from thousands to millions of bases long. DNA usually consists of two strands of complementary nucleotide sequences that are base paired to each other. DNA in humans forms a linear chain, but DNA can also form a circular molecule. A hypothetical double-stranded DNA molecule can be represented as

ACGTGGTAGAGACCCTGTGTGATAGACCACGGGTA
TGCACCATCTCTGGGACACACTATCTGGTGCCCAT

As A pairs with T and C pairs with G and vice versa, the above molecule can be represented as \smallskip ACGTGGTAGAGACCCTGTGTGATAGACCACGGGTA \smallskip Here A - Adenine, C - Cytocine, G - Guanine, T - Thymine. Thus a DNA molecule can be thought of as a string over an alphabet of four characters {A, T, C, G} (nucleotides). We give below brief definitions for some more important terms in molecular biology.

The RNA is the same as above DNA with the exception that T is replaced by U, which represents uricil nucleotide.

An organism is further classified into two types.

  1. Eukaryotes - higher-order organisms whose DNA is enclosed in a cell nucleus. E.g. humans
  2. Prokaryotes - organisms such as bacteria whose DNA is not enclosed in a nucleus. E.g. bacteria.

mRNA (messenger RNA) is the mediating template between DNA and proteins. The information from a particular gene is transferred from a strand of DNA by the construction of a complementary strand of RNA through a process known as transcription. Next three nucleotide segments of RNA, called tRNA (transfer RNA), which are attached to specific amino acids, match up with the template strand of mRNA to order the amino acids correctly. These amino acids are then bonded together to form a protein. This process, called translation occurs in the ribosome, which is composed of proteins and the third kind of RNA, rRNA (ribosomal RNA)

cDNA represents complementary or copy DNA. This is DNA made from messenger RNA through reverse transcription.

transfer RNA (tRNA) is the RNA molecule that transports a specific amino acid to a growing amino acid chain, as directed by a specific codon in the mRNA.

Translation is the process by which a protein is synthesized, according to the blue print given by a messenger RNA molecule.

A gene is a contigous interval of DNA that contains the information needed to code for a protein. Genes form the basic units of heredity.

Transcription is the process by which an RNA molecule is synthesized complementary to the DNA in a gene. In eukaryotes, both the introns and the exons are transcribed into the RNA. The introns are later spliced out, creating an mRNA. According to the genetic determinism principle, it is feasible to predict the function of every gene in the genome by its sequence information. Implicitly this assumes that the environment of each gene is also computable from the complete genome sequence because the function of a molecule can only become meaningful in relation to its environment. Therefore, the entire molecular architectures and molecular reaction pathways in a cell may be computable from the genomic sequence. Thus it is true that the form and function of an organism are represented in the nucleus. Hence, in a way, the genome is simply a warehouse of parts, or building blocks of life, and a real blueprint of life is written in the entire cell, as a network of molecular interactions.

There are two standard assumptions:

These two assumptions allow to realize the importance of sequence-level investigations. Below comes some of the standard quotations.

The digital information that underlies biochemistry, cell biology, and development can be represented by a simple string of G's, A's, T's, and C's.

In a very real sense, molecular biology is all about sequences. It tries to reduce complex biochemical phenomena to interactions between defined sequences.

The ultimate rationale behind all purposeful structures and behavior of living things is embodied in the sequence of residues of polypeptide chains.

Computational Applications in Molecular Biology

So, without delving too much into the more difficult chemical and biological aspects of DNA and protein, computer scientists were empowered to consider a variety of biologically important problems defined primarily on sequences, or (more in the computer science vernacular) on strings;

We below give some molecular biology applications that can be represented and solved by a variety of concepts and algorithmic techniques currently available in theoretical computer science.

A protein can be thought of as a string over an alphabet of twenty characters (amino acids). A gene, which is physically embedded in a DNA molecule, typically encodes the amino acid sequence for a particular protein. This is done as follows: Starting at a particular point in the DNA string, every three consecutive DNA characters encode a single amino acid. Such a coding triple is called a codon, and the full association of codons to amino acids is called the genetic code. For example, the codon ttt codes for the amino acid Phenylalanine (abbreviated in the single character amino acid alphabet as F), and the codon gtt codes for the amino acid Valine (abbreviated as V). Since there are 4 times 4 times 4 = 64 possible triples but only twenty amino acids, there is a possibility that two or more triples form codons for the same amino acid and that some triples do not form codons. The question now is as follows:

Suppose one is given a DNA string of n nucleotides, but you do not know the correct "reading frame". That is, you do not know if the correct decomposition of the string into codons begins with the first, second, or third nucleotide of the string. Each such "frameshift" potentially translates into a different amino acid string. The task is to produce, for each of the three reading frames, the associated amino acid string. For example, consider the string atggacgga. The first reading frame has three complete codons, atg, gac, and gga, which in the genetic code specify the amino acids Met, Asp, and Gly. The second reading frame has two complete codons, tgg and acg coding for amino acids Trp and Thr. The third reading frame has two complete codons, gga and cgg, coding for amino acids Gly and Arg. The goal is to produce the three translations, using the fewest number of character examinations of the DNA string and the fewest number of indexing steps (when using the codons to look up amino acids in a table holding the genetic code). Clearly, the three translations can be done with 3n examinations of characters in the DNA and 3n indexing steps in the genetic code table. Find a method that does the three calculations in at most n character examinations and n indexing steps.

The second one is as follows. The input is a protein string S_1 over the amino acid alphabet of length n and another protein string of length m > n. Determine if there is a string specifying a DNA encoding for S_2 that contains a substring specifying a DNA encoding of S_1 allowing the encoding of S_1 to begin at any point in the DNA string for S_2 (i.e., in any reading frame of that string).

The third one is given below. Let T be a text string of length m and let S be a multiset of n characters. The problem is to find all substrings in T of length n that are formed by the characters of S. For example, let S = a,a,b,c and T = abahgcabah. The caba is a substring in T formed from the characters of S. To find a solution to this problem that runs in O(m) time.

The above problem may become useful in sequencing protein from a particular organism after a large amount of the genome of that organism has been sequenced. In prokaryotes, the amino acid sequence for a given protein is encoded in a contiguous segment of DNA - one DNA codon for each amino acid in the protein. So we assume that we have the protein molecule but do not know its sequence or the location of the gene that codes for the protein. Presently, chemically determining the amino acid sequence of a protein is very slow, expensive, and somewhat unreliable. However finding the multiset of amino acids that make up the protein is relatively easy. Now suppose that the whole DNA sequence for the genome of the organism is known. One can use that long DNA sequence to determine the amino acid sequence of a protein of interest. First translate each codon in the DNA sequence into the amino acid alphabet (this may have to be done three times to get the proper time) to form the string T; then chemically determine the multiset S of amino acids in the protein; then find all the substrings in T of lenght |S| that are formed from the amino acids in S. Any such substrings are candidates for the amino acid sequence of the protein although it is unlikely that there will be more than one candidate. The match also locates the gene for the protein in the long DNA string.

Importance of exact matching problem in molecular biology

Given a string P called the pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.

For example, if P = aba and T = bbabaxababay the P occurs in T starting at locations 3, 7, and 9. Note that two occurrences of P may overlap as illustrated by the occurrences of P at locations 7 and 9. \smallskip A string S is an ordered list of characters written contiguously from left to right. For any string S, S[i..j] is the contiguous substring of S that starts at position i and ends at position j of S. S[i..i] is the prefix of string S that ends at position i, and S[i..|S|] is the suffix of string that begins at position i. For example, university is a string, ver is a substring, univ is a prefix and sity is a suffix.

Both string and sequence represents the same in the biological context. But subsequence is altogether different from substring as characters in a subsequence might be interspersed with characters not in the subsequence.

There is a naive method for exact matching and are ideas for speeding up the naive method. In addition, there is the preprocessing approach with linear time complexity for exact matching.

Apart from these, there are classical comparison-based methods such as the BM algorithm and its several variants, the KMP algorithm,

The practical importance of the exact pattern (mostly strings) matching problem should be obvious to many. The problem arises in widely used varying applications such as in word processors; in utilities such as grep on Unix; in textual information retrieval programs such as Medline and in numerous databases.

In molecular biology, there are several hundred specialized databases holding raw DNA, RNA, and amino acid strings, or processed patterns (called motifs) derived from the raw string data.

The currently available algorithms for this problem are a little bit slow and erroneous due in the contexts that billions of DNA sequences are stored in present day DNA and protein databases available world wide and the methods for DNA mapping and sequencing are getting advanced due to the arrival of DNA Recombinant technology. Certainly there are faster, common database searching programs (for example, BLAST, FASTA 3, FLASH, ..) and there are faster machines. There are liner time algorithms for this problem.

But the question here is that exact matching will remain a problem of interest as the size of the databases grow exponentially and also because it will continue to be a subtask needed for more complex searches that will be devised in the near future to fulfill the various and advanced requirements of molecular biologists.

Exact set matching problem

Thus far, we discussed about many algorithms for exact matching and their implications for molecular biology applications. An immediate and important generalization of the exact matching problem is to find all occurrences in text T of any pattern in a set of patterns P = P_1, P_2, ...P_k. This is called the exact set matching problem. Let n denote the total length of all the patterns in P and m be, as before, the length of T.

The Aho-Corasick method helps to solve this problem in O(n+m+k), where k is the number of occurrences in T of the patterns from P. There is also an equally efficient, but more robust, method fro the exact set matching problem using suffix trees.

dictionary problem

There arise a special case of set matching called the dictionary problem. In the dictionary problem, a set of strings forming a dictionary is initially known and preprocessed. Then a sequence of individual strings will be presented; for each one, the task is to find if the presented string is contained in the dictionary. Let us see a couple of applications of exact set matching.

Matching against a DNA or protein library of known patterns

The concept of a Sequence-tagged-site (STS) is one of the most useful by-products that has come out of the Human Genome Project. An STS is intuitively a DNA string of length 200-300 nucleotides whose right and left ends, of length 20-30 nucleotides each, occur only once in the entire genome. Thus each STS occurs uniquely in the DNA of interest. An early goal of the Human Genome Project was to select and map (locate on the genome) a set of STSs such that any substring in the genome of length 100,000 or more contains at least one of those STSs. A more refined goal is to make a map containing ESTs (expressed sequence tags), which are STSs that come from genes rather than parts of intergene DNA. ESTs are obtained from mRNA and cDNA and typically reflect the protein coding parts of a gene sequence.

With an STS map, one can locate on the map any sufficiently long string of anonymous but sequenced DNA - the problem is just one of finding which STSs are contained in the anonymous DNA. Thus with STSs, map location of anonymous sequenced DNA becomes a string problem, an exact set matching problem. The STSs or the ESTs provide a computer-based set of indices to which new DNA sequences can be referenced. Presently, hundreds of thousands of STSs and tens of thousands of ESTs have been found and placed in computer databases. Note that the total length of all the STSs and ESTs is very large compared to the typical size of an anonymous piece of DNA. Consequently, the keyword tree and the Aho-Corasick method with a search time proportional to the length of the anonymous DNA are of direct use in this problem for they allow very rapid identification of STSs and ESTs that occur in newly sequenced DNA.

Exact matching with wide cards

Here we modify the exact matching problem by introducing a character \, called a wild card, that matches any single character. Given a pattern P containing wild cards, we want to find all occurrences of P in a text T. For example, the pattern ab\ \c\ occurs twice in the text xabvccbababcax. Note that in this version of the problem no wild cards appear in T and that each wild card matches only a single character rather than a substring of unspecified length.

One very important case where simple wild cards occur is in DNA transcription factors. A transcription factor is a protein that binds to specific locations in DNA and regulates, either enhancing or suppressing, the transcription of the DNA into RNA. In this way, production of the protein that the DNA codes is regulated.

Regular expression pattern matching

A regular expression is a way to specify a set of related strings, sometimes referred to as a pattern. Many important sets of substrings (patterns) found in biosequences particularly in proteins can be specified as regular expressions and several databases have been constructed to hold such patterns. The PROSITE database is the major regular expression database for significant patterns in proteins.

The following PROSITE expression specifies a set substrings, some of which appear in a particular family of granin proteins:
[ED]-[EN]-L-[SAN]-x-x-[DE]-x-E-L
Every string specified by this regular expression has ten positions, which are separated by a dash. Each capital letter specifies a single amino acid and a group of amino acids enclosed by brackets indicates that exactly one of those amino acids must be chosen. A small x indicates that any one of the twenty amino acids from the protein alphabet can be chosen for that position. This regular expression describes 192,000 amino acid strings, but only a few of these actually appear in any known proteins.

All of the exact matching methods discussed so far are examples of comparison-based methods. The main primitive operation in each of those methods is the comparison of two characters. There are, however, string matching methods based on bit operations or on arithmetic, rather than character comparisons. These methods have a different flavor than the earlier approaches. Some of the examples are the Shift-And method, agrep and Karp-Rabin fingerprint method.

Suffix Trees

A suffix tree T for an m-character string S is a rooted directed tree with exactly m leaves numbered 1 to m. Each internal node, other than the root, has at least two children and each edge is labeled with a non-empty substring of S. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i. That is, it spells out S[i..m].

The classical application for suffix trees is the substring problem. One is first given a text T of length m. After O(m), or linear, preprocessing time, one must be prepared to take in any unknown string S of length n and in O(n) time either find an occurrence of S in T

The label of a path from the root that ends at a node is the concatenation, in order, of the substrings labeling the edges of that path. The path-label of a node is the label of the path from the root of T to that node. For any node v in a suffix tree, the string-depth of v is the member of characters in v's label.

A path that ends in the middle of an edge (u,v) splits the label on (u,v) at a designated point. Define the label of such a path as the label of u concatenated with the characters on edge (u,v) down to the designated split point.

We have already seen a number of efficient and fast algorithms for the exact matching problem. Suffix tree provides another approach.
Build a suffix tree \Gamma for text T in O(m) time. Then, match the characters of P along the unique path in \Gamma until either P is exhausted or no more matches are possible. In the latter case, P does not appear anywhere in T. In the former case, every leaf in the subtree below the point of the last match is numbered with a starting location of P in T, and every starting location of P in T numbers such a leaf.

There is a naive algorithm to build suffix tree, but there are methods such as Ukkonen's method and Weiner's method for linear-time construction of suffix trees. Several years after Weiner published his linear-time algorithm to construct a suffix tree for a string S, McCreight gave a different method that also runs in linear time but is more space efficient in practice.

These methods are to build a suffix tree for a single string in linear time. These methods can be easily extended to represent the suffixes of a set \S_1, S_2, ..., S_z\ of strings. These suffixes are represented in a tree called a generalized suffix tree.

Both exact matching and exact set matching problems can be solved by suffix trees in O(m) preprocessing time and space (building a suffix tree of sice O(m) for the text T) and O(n+k) search time where n is the length of the pattern and k is the number of occurrences. In the next section, let us see a couple of fine applications of suffix tree here even though there are many applications that can be solved by suffix trees.

The substring problem for a database of patterns

A set of strings or a database is first known and fixed. Later, a sequence of strings will be presented and for each presented string S, the algorithm must find all the strings in the database containing S as a substring

The DNA database contains a collection of previously sequenced DNA strings. When a new DNA string is sequenced, it could be contained in an already sequenced string. The total length of all the strings in the database, denoted by m, is assumed to be large. What constitutes a good data structure and lookup algorithm for the substring problem. The two constraints are that the database should be stored in a small amount of space and that each lookup should be fast. The third feature is that the preprocessing of the database should be relatively fast.

Suffix trees yield a very attractive solution to this database problem. A generalized suffix tree T for the strings in the database is built in O(m) time and, more importantly, requires only O(m) space. Any single string S of length n is found in the database, or declared not to be there, in O(n) time. As usual, this is accomplished by matching the string against a path in the tree starting from the root. The full string S is in the database if and only if the matching path reaches a leaf of T at the point where the last character of S is examined. Moreover, if S is a substring of strings in the database then the algorithm can find all strings in the database containing S as a substring. This takes O(n+k) time, where k is the number of occurrences of the substring.

Longest common substring of two strings

A classic problem in string analysis is to find the longest substring common to two given strings S_1 and S_2. This is the longest common substring problem. For example, if S_1 = KyotoUniversity and S_2 =Osacakauniversalmuseum, then the longest common substring of S_1 and S_2 is univers. The longest common substring of two strings can be found in linear time O(m + n) using a generalised suffix tree.

Recognizing DNA contamination

Often the various laboratory processes used to isolate, purify, clone, copy, maintain, probe, or sequence a DNA string will cause unwanted DNA to become inserted into the string of interest or mixted together with a collection of strings. Contamination is often caused by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism or the contamination is from the DNA of the host itself. Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (Polymerase chain reaction) used to make copies of the desired DNA.

Thus the DNA contamination problem can be represented as follows. Given a string S_1 (the newly isolated and sequenced string of DNA) and a known string S_2 (the combined sources of possible contamination), find all substrings of S_2 that occur in S_1 and that are longer than some given length l. These substrings are candidates for unwanted pieces of S_2 that have contaminated the desired DNA string.

This problem can be easily solved in linear time by extending the approach discussed above for the longest common substring of two strings.

Common substrings of more than two strings

In biological strings (DNA, RNA, or Protein) the problem of finding substrings common to a large number of distinct strings arises because mutations that occur in DNA after two species diverge will more rapidly change those parts of the DNA or protein that are less functionally important. The parts of the DNA or protein that are critical for the correct functioning of the molecule will be more highly conserved, because mutations that occur in those regions will more likely be lethal. Therefore, finding DNA or protein substrings that occur commonly in a wide range of species helps point to regions or subpatterns that may be critical for the function or structure of the biological string.

The problem can be formally defined as follows:

Suppose we have K strings whose lengths sum to n. For each k between 2 and K and l(k) to be the length of the longest substring common to atleast k of the strings. Then the problem can be solved in linear, O(n) time.

All-pairs suffix-prefix matching

Given two strings S_i and S_j, any suffix of S_i that matches a prefix of S_j is called a suffix-prefix match of S_i, S_j.

Given a collection of strings S = S_1, S_2,...,S_k of total length m, the all-pairs suffix-prefix problem is the problem of finding, for each ordered pair S_i, S_j in S, the longest suffix-prefix match of S_i, S_j.

The following application by Green et al. motivates this problem. In that research, a set of around 1,400 ESTs from the organism C.elegans were analysed for the presence of highly conserved substrings called ancient conserved regions (ACRs). One of the main objectives of the research was to estimate the number of ACRs that occur in the genes of C.elegans. Their approach was to extrapolate from the number of ACRs they observed in the set of ESTs.

The goal is to use the ACR data observed in the ESTs to estimate the number of ACRs in the entire set of genes. A sample extrapolation would be justified if the ESTs were essentially random samples selected uniformly from the entire set of C.elegans genes. However, genes are not uniformly sampled, so a simple extrapolation would be wrong if the prevalence of ACRs is systematically different in ESTs from frequently or infrequently expressed genes. The approach taken by Green and his team is to compute the overlap between each pair of ESTs. Since all the ESTs are of comparable length, the heart of that computation consists of solving the all-pairs suffix-prefix problem on the set of ESTs. Because there may be some sequencing errors, and because substring containment is possible among strings of unequal length, one should also solve the all-pairs longest common substring problem. After categorizing the ESTs in this way, it was indeed found that ACRs occur more commonly in ESTs from frequently expressed genes, that is, from ESTs that overlap other ESTs. This problem can be solved in linear time using suffix tree when the alphabet is fixed.

Repetitive structures in molecular strings

There are a number of occurrences of important repetitive structures in biological strings (DNA, RNA and protein). There are several efficient algorithms and methods for finding various types of repetitive structures in stings. One of the striking features of DNA and Protein is the extent to which repeated substrings occur in the genome. This is particularly true of eukaryotes. Most of the human Y chromosome consists of repeated substrings and over all families of reiterated sequences account for about one third of the human genome.

There are three varieties of repetitive structures in DNA and protein.

Small-scale local repeats whose function or origin is partially understood include: complemented palindromes in both DNA and RNA, which act to regulate DNA transcription; nested complemented palindromes in tRNA that allow the molecule to fold up into a cloverleaf structure by complementary base pairing; tandem arrays of repeated RNA that flank retroviruses and facilitate the incorporation of viral DNA (produced from the RNA sequence by reverse transcription) into the host's DNA; single copy inverted repeats that flank transposable (movable) DNA in various organisms and that facilitate that movement or the inversion of the DNA orientation; short repeated substrings (both palindromic and nonpalindromic) in DNA that may help the chromosome fold into a more compact structure; repeated substrings at the ends of a viral DNA in a linear state that allow the concatenation of many copies of the viral DNA (a molecule of this type is called a concatamer); copies of genes that code for important RNAs (rRNAs and tRNAs) that must be produced in large number; clustered genes that code for important proteins such as histone that regulate chromosome structure and must be made in large number; families of genes that code for similar proteins (hemoglobins and myoglobins); similar genes that probably arose through duplication and subsequent mutation including pseudogenes that have mutated to the point that they no longer function; common exons of eukaryotic DNA that may be basic building blocks of many genes; and common functional or structural subunits in protein (motifs and domains).

Restriction enzyme cutting sites illustrate another type of small-scale, structured, repeating substring of greater importance to molecular biology. A restriction enzyme is an enzyme that recognizes a specific substring in the DNA of both prokaryotes and eukaryotes and cuts or cleaves the DNA every place where that pattern occurs exactly where it cuts inside the pattern varies with the pattern.

Restriction enzyme cutting sites are interesting examples of repeats because they tend to be complemented palindromic substrings. For example, the restriction enzyme EcoRI recognizes the complemented palindrome GAATTC and cuts between the G and the adjoining A (the substring TTC when reversed and complemented is GAA). ,p> Simple repeats that are less well understood often arise as tendem arrays (consecutive repeated strings) of repeated DNA. For example, the string TTAGGG appears at the ends of every human chromosome in arrays containing one to two thousand copies. A number of genetic diseases (Fragile X syndrome, Huntington's disease, Kennedy's disease, myotonic dystrophy, ataxia) are now understood to be caused by increasing numbers of tandem DNA repeats of a string three bases long. These triplet repeats somehow interfere with the proper production of particular proteins. Moreover, the number of triples in the repeat increases with successive generations, which appears to explain why the disease increases in severity with each generation.

Other long tandem arrays consisting of short strings are very common and are widely distributed in the genomes of mammals. These repeats are called satellite DNA. Repetitive DNA that is interspersed throughout mammalian genomes, and whose function and origin is less clear, is generally divided into SINEs (short interspersed nuclear sequences) and LINEs (long interspersed nuclear sequences). The classic example of a SINE is the Alu family. The Alu repeats occur about 300,000 times in the human genome and account for as much as 5% of the DNA of human and other mammalian genomes. Alu repeats are substrings of length around 300 nucleotides and occur as nearly identical copies widely dispersed throughout the genome. Moreover, the interior of an Alu string itself consists of repeated substrings of length around 40, and the Alu sequence is often flanked on either side by tandem repeats of length 7-10. Those right and left flanking sequences are usually complemented palindromic copies of each other.

One of the most fascinating discoveries in molecular genetics is a phenomenon called genomic or gametic imprinting, whereby a particular allele of a gene is expressed only when it is inherited from one specific parent. Sometimes the required parent is the mother and sometimes the father. The allele will be unexpressed, or expressed differently, if inherited from the incorrect parent. In mice and humans, around sixteen imprinted gene alleles have been found. Five of them require inheritance from the mother, and the rest from the father. The DNA sequences of these imprinted genes all share the common feature that they contain or closely associated with, a region rich in direct repeats. These repeats range in size from 25 to 120 base pairs, are unique to the respective imprinted regions, but have no obvious homology to each other or to highly repetitive mammalian sequences.

Uses of repetitive structures in molecular biology

Genetic mapping requires the identification of features or markers in the DNA that are highly variable between individual and that are intersperced frequently throughout the genome. Tandem repeats are just such markers. What varies between individuals is the number of times the substring repeats in an array. Hence the term used for this type of marker is variable number of tandem repeats (VNTR). VNTRs occur frequently and regularly in many genomes, including the human genome, and provide many of the markers need for large scale genetic mapping. The VNTR markers are used during the genetic-level as opposed to the physical-level) search for specific defective genes and in forensic DNA fingerprinting since the number of repeats is highly variable between individuals, a small set of VNTRs can uniquely characterize individuals in a population. Tandem repeats consisting of a very short substring, often only two characters long, are called microsatellites and have become the preferred marker in many genetic mapping efforts.

The existence of highly repetitive DNA, such as Alu, makes certain kinds of large-scale DNA sequencing more difficult, but their presence can also facilitate certain cloning, mapping and searching efforts. For example, one general approach to low-resolution physical mapping or to find genes causing disease involves inserting pieces of human DNA that may contain a feature of interest into the hamster genome. Each resulting hybrid-hamster cell incorporates different parts of the human DNA and these hybrid cells can be tested to identify a specific cell containing the human feature of interest. In this cell, one then has to identify the parts of the hamster's hybrid genome that are human. To find the distinguishing feature between human and hamster DBA, the better approach is to exploit the Alu sequences. Alu sequences specific to human DNA are so common in the human genome that most fragments of human DNA longer than 20,000 bases will contain an Alu sequence. Therefore, the fragments of human DNA in the hybrid can be identified by probing the hybrid for fragments of Alu.

Finding all maximal repetitive structure

Before entering into the algorithm details, here are some important definitions.

A maximal pair or a maximal repeated pair in a string S is a pair of identical substrings \alpha and \beta in S such that the character to the immediate left (right) of \alpha is different from the character to the immediate left (right) of \beta. That is, extending \alpha and \beta in either direction would destroy the equality of the two strings.

A maximal pair is represented by the triple (p_1, p_2,n'), where p_1 and p_2 give the starting positions of the two substrings and n' gives their length. For a string S, R(S) represents the set of all triples describing maximal pairs in S.

For example, consider the string S = xabcykkkzabcqabcyrxar, where there are three occurrences of the substring abc. The first and second occurrences of abc form a maximal pair (2,10,3), and the second and third occurrences also form a maximal pair (10, 14, 3), whereas the first and third occurrences of abc do not form a maximal pair. The two occurrences of the string abcy also form a maximal pair(2, 14, 4).

A maximal repeat \alpha is a substring of S that occurs in a maximal pair in S. That is, \alpha is a maximal repeat in S if there is a triple (p_1,p_2, |\alpha |) \in R(S) and \alpha occurs in S starting at position p_1 and p_2. Let R'(S) denote the set of maximal repeats in S. For example, both strings abc and abcy are maximal repeats.

A supermaximal repeat is a maximal repeat that never occurs as a substring of any other maximal repeat.

All the maximal repeats in S can be found in O(n) time, and a tree representation for them can be constructed from suffix tree \Gamma in O(n) time as well. Similarly, all supermaximal repeats can be identified in linear time. It is proved that all the maximal pairs can be found in O(n+k) time, where k is the number of maximal pairs. If there are only k_m maximal pairs of length above a given threshold, then all those can be found in O(n=k_m) time.

Circular string linearization

We discussed a variety of problems on linear strings but there are some occasions where we need to think about the existence of circular strings in DNA molecules. Bacterial and mitochondrial DNA is typically circular. Consequently software tools for handling circular strings may someday be of use in those organisms. Discovering a linear-time algorithm for finding all the occurrences of pattern P in circular string T is a goal.

A circular string S of length n is a string in which character n is considered to precede character 1.

The characters of S are initially numbered sequentially from 1 to n starting at an arbitrary point in S.

Given an ordering of the characters in the alphabet, a string S_1 is lexically or lexicographically smaller than a string S_2 if S_1 would appear before S_2 in a normal dictionary ordering of the two strings. That is, starting from the left end of the two strings, if i is the first position where the two strings differ, then S_1 is lexically less than S_2 if and only if S_1(i) precedes S_2(i) in the ordering of the alphabet used in those strings.

The circular string linearization problem for a circular string S of n characters is the following: Choose a place to cut S so that the resulting linear string is the lexically smallest of all the n possible linear strings created by cutting S. This problem can be solved busing suffix trees in O(n) linear time.

Suffix arrays

Manber and Myers proposed a new data structure called suffix array that is very space efficient and yet can be used to solve the exact matching or the substring problems almost as efficiently as with a suffix tree.

Given an m-character string T, a suffix array for T, called Pos, is an array of the integers in the range 1 to m, specifying the lexicographic order of the m suffixes of string T. That is, the suffix starting at position Pos(i) of T is lexically smaller than suffix Pos(i+1).

As an example of a suffix array, if T is mississippi, then the suffix array Pos is 11, 8, 5, 2, 1, 10. 9, 7, 4, 6, 3.

A suffix array for T can be obtained from the suffix tree \Gamma for T by performing a lexical depth-first traversal of \Gamma. Once the suffix array is built, the suffix tree is discarded.

An edge (v,u) is lexically less than an edge (v,w) if and only if the first character on the (v.u) edge is lexically less than the first character on (v,w).

The suffix array Pos for a string T of length m can be constructed in (m) time.

The suffix array for string T allows a very simple algorithm to find all occurrences of any pattern P in T. The key is that if P occurs in T then all the locations of those occurrences will be grouped consecutively in Pos. For example, P = issi occurs in mississippi starting at locations 2 and 5, which are indeed adjacent in Pos. So to search for occurrences of P in T simply do binary search over the suffix array.

By using binary search an array Pos, all the occurrences of P in T can be found in O(n log m) time.

There are a couple of enhancements for this method. As the binary search proceeds, let L and R denote the left and right boundaries of the current search interval. At the start, L equals 1 and R equals m. Then in each iteration of the binary searc, a query is made at location M = \lceil(R = L)\rceil of Pos. The search algorithm keeps track of the longest prefixes of Pos(L) and Pos(R) that match a prefix of P. Let l and r denote those two prefix lengths, respectively, and let mlr = min(l,r). Myers and Manber report that use of mlr alone allows the search to run as fast in practice as the worst-case time is O(n + log m).

Lcp(i,j) is the length of the longest common prefix of the suffixes specified in positions i and j of Pos. That is, Lcp(i,j) is the length of the longest prefix common to suffix Pos(i) and suffix Pos(j). The term Lcp stands for longest common prefix. For example, when T = mississippi, suffix Pos(3) is issippi, suffix Pos(4) is ississippi and Lcp(3,4) is four.

Using the Lcp values, the search algorithm does at most O(n + log m) comparisons and runs in that time.

A large part of the motivation for suffix arrays comes from problems that arise in using suffix trees when the underlying alphabet is large. String and substring matching problems where the alphabet contains numbers, and where P and T are large, arise in computational problems in molecular biology. One example is the map matching problem. A restriction enzyme map for a single enzyme specifies the locations in a DNA string where copies of a certain substring (a restriction enzyme recognizition site) occurs. Each such site may be separated from the next one by many thousands of bases. Hence, the restriction enzyme map for that single enzyme is represented as a string consisting of a sequence of integers specifying the distances between successive enzyme sites. Considered as a string, each integer is a character of a huge underlying alphabet. More generally, a map may display the sites of the many different patterns of interest whether or not they are restriction enzyme sites, so the string (map) consists of characters from a finite alphabet (representing the known patterns of interest) alternating with integers giving the distances between such sites. It often happens that a DNA substring is obtained and studied without knowing where that DNA is located in the genome or whether that substring has been previously researched. If both the new and the previously studied DNA are fully sequened and put in a database, then the issue of previous work or locations would be solved by exact string matching. But most DNA substrings that are studied are not fully sequenced - maps are easier and cheaper to obtain than sequences. Consequently, the following matching problem on maps arises and translates to an matching problem on strings with large alphabets:

Given an established (restriction enzyme) map for a large DNA string and a map from a smaller string, determine if the smaller string is a substring of the larger one. Since each map is represented as an alternating string of characters and integers, the underlying alphabet is huge. This provides one motivation for using suffix arrays for matching or substring searching in place of suffix trees.

Thus suffix trees , generalised suffix trees and suffix arrays are bound to play a very important roles as central data structures in various genome-level projects as we see below.

Suffix trees in genome-scale projects

The Michigan state university and the university of Minnesota took the project called Arabidopsis thaliana genome project. It concerns with creating an EST map of the Arabidopsis genome. Firstly, each newly sequenced fragment is checked to catch any contamination by known vector sequences. The vector sequences are kept in a generalised suffix tree. Then each newly sequenced fragment is checked against fragments already sequenced to find duplicate sequences or regions of high similarity. The fragment sequences are kept in an expanding generalized suffix tree for this purpose. Since the project will sequence about 36,000 fragments, each of length about 400 bases, the efficiency of the searches for duplicates and for contamination is important. Finally suffix trees are used in the search for biologically significant patterns in the obtained Arabidopsis sequences. Patterns of interest are often represented as regular expressions, and generalised suffix trees are used to accelerate regular expression pattern matching, where a small number of errors in a match are allowed.

Suffix trees are also central datastructure in genome-scale analysis of Saccharomyces cerevisiae, done at the Max-Plank Institute. In this project, highly optimized suffix trees called hashed position trees are used to solve problems of clustering sequence data into evolutionary related protein families, structure prediction, and fragment assembly.

Borrelia burgdorferi is the bacterium causing Lyme disease. Its genome is about one million bases long, and is currently being sequenced at the Brookhaven National Laboratory using a directed sequencing approach to fill in gaps after an initial shotgun sequencing phase. Chen and Skiena developed methods based on suffix trees and suffix arrays to solve the fragment assembly problem for this project. In fragment assembly, one major bottleneck is overlap detection, which requires solving a variant of the suffix-prefix matching problem allowing some errors for all pairs of strings in a large set. The Borrelia work consisted of 4612 fragments (strings) totalling 2,032,740 bases. Using suffix trees and suffix arrays, the needed overlaps were computed in about fifteen minutes. To compare the speed and accuracy of the suffix tree methods to pure dynamic programming methods for overlap detection, Chen and Skiena closely examined cosmid-sized data. The test established that the suffix tree approach gives 1000 times speedup over the slightly more accurate dynamic programming approach, finding 99% of the of the significant overlaps found by dynamic programming.

Guan and Uberbache Guan have designed a fast linear look-up algorithm for detecting repetitive DNA sequences. It utilizes indices calculated from non-contiguous and overlapping k-tuples so that tandem repeats with insertions and deletions can be recognized. It has been made available through the GRAIL and GENQUEST Internet servers.

Compression of biological sequences

Large text or graphics files are often compressed in order to save storage space or to speed up transmission when the file is shipped. Most operating systems have compression utilities, and some file transfer programs automatically compress, ship and uncompress the file without user intervention. There is a popular compression method by Ziv-Lempel that uses suffix trees.

Recently, several molecular biology and computer science research groups have started to use this method to compress DNA strings not only for the purpose of obtaining efficient storage, but also to compute a measure of the complexity or information content of the strings. The basic idea is that substrings of greatest biological significance should be more compressable than substrings that are essentially random. One expects that random strings will have too little structure to allow high compression, since high compression is based on finding repetitive segments in the string, one may be able to find strings that have a definite biological function.

Compression has also been used to study the relatedness of two strings S_1 and S_2 of DNA. Essentially, the idea is to build a suffix tree for S_1 and then compress string S_2 using only the suffix tree for S_1. That compression of S_2 takes advantage of substrings in S_2 that appear in S_1, but does not take advantage of repeated substrings in S_2 alone. Similarly, S_1 can be compressed using only a suffix tree for S_2. These compressions reflect and estimate the relatedness of S_1 and S_2. If the two strings are highly related, then both computations should significantly compress the string at hand.

Another biological use of compression algorithms involves estimating the entropy of short strings in order to discriminate between exons and introns in eukaryotic DNA. Farach et al. report that the average compression of exons, and hence compression by itself does not distinguish exons from introns.

String Edits, Sequence Alignments and Dynamic Programming

The most classic inexact matching problem solved by dynamic programming is the edit distance problem, which plays a very important role in studying of evolutionary, structural or functional characteristics of biological strings. It focuses on transforming one string into the other by a series of edit operations on individual characters. The permitted edit operations are insertions of a character into the first string, the deletion of a character from the first string, or the substitution or replacement of a character in the first string with a character in the second string. For example, the string ``vintner'' can be edited to become ``writers''as follows:

v intner
wri t ers

Thus the edit distance between two strings is defined as the minimum number of edit operations needed to transform the first string into the second.

Sequence alignment is ubiquitous in molecular biology. Newly sequenced DNA fragments or putative protein sequences deduced from open reading frames are routinely aligned with databases of known sequences. The aim is to infer biological homology from sequence similarity. When two or more homologous sequences are aligned, the details of base or residue changes and of inserted or deleted regions can yield information on higher order structure. Multiple alignments of related structural RNA have been used to infer RNA secondary structure by analysing pairs of positions that vary while conserving the ability to form a base-pair. Multiple alignments of protein sequences can be used to strengthen secondary structure predictions. An alignment of two protein sequences can be used as the first step in three dimensional modeling by homology if the structure of one protein is known.

Alignment algorithms are designed to optimize a mathematical score. Any two sequences can be aligned. The first question is whether or not an optimal alignement is significant. What are the factors that decides about the significance. In the absence of reliable experimental data, statical model is the best answer for settling the question whether the alignment is mathematically significant or not.

If the alignment score of the original pair of sequences is significantly better than the expected score, one accepts the computed alignment as significant. It has been realised that the choice of match weights for comparing pairs of bases or residues is important.

Alignment can be subdivided into two types.

A global alignment of two strings S_1 and S_2 is obtained by first inserting chosen spaces or dashes, either into or at the ends of S_1 and S_2 and then placing the two resulting strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.

Dynamic programming technique turned out to be a very good algorithm for calculating the edit distance of two strings. The dynamic programming table for computing the edit distance between a string of length n and a string of length m can be filled in with O(nm) work. Hence, using dynamic programming, the edit distance D(n,m) can be computed in O(nm) time.

An easy, yet crucial, generalization of edit distance is to allow an arbitrary weight or cost or score to be associated with every edit operation, as well as with a match. Thus an operation-weight alignment is one where each mismatch costs r, each match costs e, and each space costs d. For example, if each mismatch has a weight of 2, each space has a weight of 4 and each match a weight of 1, then the alignment

w r i t \_ e r s
v i n t n e r \_

has a total weight of 17 and is an optimal alignment. The operation-weight edit distance problem can be solved in O(mn) time by the dynamic programming with a small modification on the edit distance. This can also be represented and solved as a shortest path problem on a weighted edit graph.

Another vital generalization of edit distance is to allow the weight or score of a substitution to depend on exactly which character in the alphabet is being removed and which is being added. For example, it may be more costly to replace an A with a T than with a G. Similarly, one may want the weight of a deletion or insertion to depend on exactly which a character in the alphabet is being deleted or inserted. This form of edit distance is called as the alphabet-weight edit distance.

When comparing proteins, the edit distance almost means the alphabet-weight edit distance over the alphabet of amino acids. There are a lot of research papers that talk about what scores are good and optimal for operations on amino acid characters and how the optimal values for the scores are identified. The dominant amino acid scoring schemes are now the PAM of Dayhoff and the newer BLOSUM scoring matrices of Henikoffs.

When comparing DNA strings, unweighted or operation-weight edit distance is more often computed. For example, the popular database searching program, BLAST, scores identities as +5 and mismatches as -4.

Given two strings S_1 and S_2, a common subsequence is a subsequence that appears both in S_1 and S_2. The longest common subsequence problem is to find a longest common subsequence (LCS) of S_1 and S_2. The longest common subsequence can be found in O(n \times m)

Local Alignment: finding substrings of high similarity

In many applications, two strings may not be highly similar in their entirety but may contain regions that are highly similar. The task is to find and extract a pair of regions, one from each of the two given strings, that exhibit high similarity. Thus the local alignment problem can be defines as follows:

Given two strings S_1 amd S_2, find substrings \alpha and \beta of S_1 and S_2, respectively, whose similarity (optimal global alignment) is maximum over all pairs of substrings from S_1 and S_2. For example,

consider the strings S_1 = pqraxabcstvq and S_2 = xyaxbacsll. If one gives each match a value of 2, each mismatch a value of -2, each space a value of -1, then the two substrings \alpha = axabcs and \beta = axbacs of S_1 and S_2, respectively, have the following optimal(global) alignment

a x a b \_ c s
a x \_ b a c s

which has a value of 8.

The local alignment, where matches contribute positively and mismatches and spaces contribute negatively, is more likely to find more meaningful regions of high similarity.

Global alignment of protein sequences is often meaningful when the two strings are members of the same protein family. For example, the protein cytochrome c has almost the same length in most organisms that produce it and one expects to see a relationship between two cytochromes from any two different species over the entire length of the two strings. In this case, global alignment is meaningful. When trying to deduce evolutionary history by examining protein sequence similarities and differences, one usually compares proteins in the same sequence family and so global alignment is typically meaningful and effective in those applications.

However, in many applications, local alignment is far more meaningful than global alignment. This is particularly true when long stretches of anonymous DNA are compared, since only some internal sections of those strings may be related. When comparing protein sequences, local alignment is also critical because proteins from very different families are often made up of the same structural or functional subunits (motifs or domains), and local alignment is appropriate in searching for these unknown subunits.

Refining core string edits and alignments

The dynamic programming tables use O(nm) space when the input strings have length n and m. Hirschberg developed an elegant and practical space-reduction method that works for many dynamic programming problems. For several string alignment problems, this method reduces the required space from O(nm) to O(n) (for n < m) while only doubling the worst-case time bound.

{Homology Searching

The central goals of the human and model organism genome projects are to completely map and sequence the genes of these organisms. As work progresses, identification of the biochemical function of newly sequenced genes becomes a major challenge. This process of knowing the function ensures is being accomplished by comparing the sequence of the newly identified gene with all the sequences in the database. The sequence comparison occurs at many levels, from the shotgun operation to phylogeny treatment apart from the database scanning. An unusual pattern in a nucleic acid or protein sequence or a region of strong similarity shared by two or more sequences may have biological significance. It is therefore desirable to know whether such a pattern can have arisen simply by chance. To identify interesting sequence patterns, appropriate scoring values can be assigned to the individual residues of a single sequence or to sets of residues when several sequences are compared.

Identification of gene function using traditional biochemical methods can be extremely slow and laborious task that can take years of effort even for a single gene. Also, as the rate of DNA sequencing increases manifold, analysis by sequence similarity search will need to become much more efficient in terms of sensitivity, automation potential and consistency in annotation. Depending both on the volume of data to be processed and the accuracy of the comparisons, the computation can be time consuming and even a performance bottleneck. Usually, this constraint is being met by different techniques: software heuristics, parallelization on massively parallel computers, distribution on a network of workstations, or dedicated hardware.

These can greatly facilitate the identification of gene function. When a gene is isolated and sequenced, it can be matched against one or more of the publicly available sequence databases, such as GenBank. If a similar gene of known function can be identified in such a data base search, then the function of the newly sequenced gene can be surmised by analogy. The biochemical functions of a growing number of genes, including a number of inherited human disease genes are being determined in this way. Currently, high-speed heuristic methods, such as the hash-coding (k-tuple) algorithm employed by FASTA and the approximate word match algorithm employed by BLAST are the most commonly used sequence data base search programs.

Proteins that share a common ancestor are called homologous. Sequence comparison is most informative when it detects homologous proteins. Homologous proteins always share a common three-dimensional folding structure and they often share common active sites or binding domains. Frequently, homologous proteins share common functions, but sometimes they do not. The biological properties of a protein based on sequence data alone stems almost exclusively from properties conserved through evolutionary time. Preditions of common properties for non-homologous proteins - similarities that have arisen by convergence - are much less reliable. \subsection{Clustering of protein sequences} The genome sequencing projects and the worldwide determination of individual sequences are producing a vast number of protein sequences. To introduce some order into these data, it is often desirable to group the sequences in a database, in a genome or in a set of genomes, into protein families. This provides evolutionary, functional and structural information on the sequences. The first step in creating protein families is to match all the sequences in the database to each other. The most sensitive single-sequence matching procedure is the Smith-Waterman algorithm. Using the match scores produced by this algorithm, protein sequences can be placed in families by single-linkage clustering: all sequences connected by scores that are believed to indicate a significant relationship are clustered together.

Protein folding problem

The protein folding problem has been one of the the grand challenges in computational molecular biology. The problem is to predict the native three-dimensional structure of a protein from its amino acid sequence. It is widely believed that the amino acid sequence contains all the necessary information to make up the correct three-dimensional structure, since the protein folding is apparently thermodynamically determined; namely, given a proper environment, a protein would fold up spontaneously.

While this principle is well established in selected proteins under in vitro experimental conditions, protein folding in vivo is a more complex and dynamic process involving a number of other molecules. The environment has to be considered as a collection of various interactions with molecules rather than a smooth thermodynamic environment. The algorithms developed for secondary structure prediction help to solve this problem but the success rate will be limited as long as only the short-range interactions are considered. Similarly, however good the algorithms developed for the three-dimensional structure prediction are, the success rate will be limited as long as only the information of a single molecule is examined.

Organism reconstruction problem

In the era of whole-genome sequencing, we are faced with another grand challenge problem, which may be called the organism reconstruction problem. Given a complete genome sequence, the problem is to predict computationally the development of the adult from a single cell and its continual function as a biological organism. Here again, a traditional view is that the genome is a blueprint of life containing all the necessary information that would make up an organism. A clone can be made by replacing the nucleus, which is the localized area containing all genetic information. Thus, this might be called Dolly's cloning principle.

Maps, Mapping, Sequencing, and Superstrings

In this section, we are to see a number of theoretical and practical issues in creating and using genome maps and in large-scale genomic DNA sequencing.

The ultimate goal of the Human Genome Project (HGP) is to sequence the entire human genome, producing the complete DNA transcript - the full three billion long nucleotide string. The initial goal of the HGP and other genome projects is to produce maps of the genome showing the locations of landmarks such as STSs and ESTs at regular intervals over the genome. Genome maps are critical in hunting for specific genes of interest; they are used in the first stages of most large-scale DNA sequencing projects and they are useful for the further physical examination of DNA that is required for other parts of the genome project.

Physical and genetic maps

There are two kinds of genome maps: genetic and physical maps. The term 'physical mapping' refers to establishing the true physical location of landmarks (markers) or of features of interest often genes or known patterns in the genome. The metric used is based on observable or countable physical features, usually the number of nucleotides between two points. The earliest physical maps consisted of the banding patterns that appear on chromosomes when various stains are applied. These bands are reproducible and visible under a microscope and so provide a low-resolution physical map. This map has been used to locate the rough position of genes whose deletion or expansion causes disease, when those chromosome changes are large enough to change the observable banding patterns.

In current high-resolution physical mapping, the goal is to locate features of interest in terms of their actual base-pair location on a chromosome. That is, locations are measured and described in terms of the number of nucleotides (base pairs) between the feature and some other established reference point. The sequence tagged site (STS) is one such feature of particular importance, and one of the initial goals of the HGP was to establish a map of STSs spaced roughly every 100,000 bases throughout the entire genome. Compared to genetic mapping, high-resolution physical mapping on a larger scale is relatively new, is very dependent on recombinant DNA technology.

In contrast, genetic mapping establishes the relative order of features of interest on a chromosome or establishes that two features are likely on different chromosomes. Genetic mapping uses a unit called a centimorgan that is only monotonic with physical distance. Genetic mapping is based on observing how frequently the states (alleles) of two loci on the DNA (possibly genes) are inherited together or are changed by an odd number of cross-over or recombination events in meiosis. The higher the frequency of coinheritance, the closer the two loci are deduced to be. To establish a linkage map, one therefore wants to find a set of uniformly dispersed loci in the DNA where the alleles of those loci are highly variable (or polymorphic). Originally, the loci were genes responsible for observable features at the organism or phenotypic level, but in the past decade, features observed directly in the DNA have become the critical markers. The first markers of this type were restriction-fragment-length polymorphisms, but increasingly microsatelllites have become the markers of choice.

Thus, the idea is to use genetic map to analyze patterns of disease inheritance in several generations of an affected family. By seeing which loci have alleles that are frequently coinherited with the disease, one can pin down the disease-causing gene between two locations on the genetic map. Then the physical maps of that region allow one to pull out substrings of DNA for closer examination and/or sequencing. Further study is done on those substrings that have the look and feel of a gene for example, an open reading frame, a large concentration of Gs and Cs, or a codon usage that is typical of genes in that organism. The goal is to find a small substring of DNA (often only a single nucleotide) from a putative gene that exhibits a systematic difference between disease-afflicted and unafflicted individuals. The gene containing that substring is then the prime candidate for the disease-causing gene.

Linguistic approaches to biological sequences

Biologists have long made use of linguistic metaphors in describing and naming cellular processes involving nucleic acid and protein sequences\cite{Searls, Yokomori}. In the recent past has the long-standing metaphor of DNA as language been rigorously examined. This metaphor arose early in molecular biology,when nucleic acids were recognized as strings of nucleotide bases comprising the famous four-letter alphabet.

The central dogma of molecular biology is that the fundamental two-step process that first transcribes a subsequence of DNA into RNA, and then translates successive triplets of RNA bases to amino acids according to the mapping called the genetic code. Later, it was discovered that the RNAs that are translated, called messenger or mRNAs, are further processed in higher organisms so as to splice out intervening untranslated regions, called introns, leaving only the exons of the ultimate transcript. These biological transformations can be seen as analogous at a number of levels to mechanisms of processing other kinds of languages such as natural languages and computer languages particularly.

Formal language theory and biological sequences

Noam Chomsky formulated a hierarchy of formal languages. This hierarchy classifies the linguistic complexity of languages (viewed purely as sets of strings) and relates them to species of automata required to recognize or generate the languages. Similarly, each level of the hierarchy corresponds to a particular type of grammar, or rule-based system for formally specifying languages.

The process of transcription and translation from strings of one kind to strings of a different kind by processive cellular machinery suggested to some the behavior of certain kinds of finite state automata (FSAs). It has been exhibited that FSAs have the capability of modeling the processive processes implied by the central dogma. The FSA, according to Chomsky, occupies the lowest level in the hierarchy. There are certain other languages such as palindromes, that can not be accepted or generated by FSA but can be done by context-free languages, second in his hierarchy. Neverthless, there are yet other languages that elude even the context-free formalism. These languages can be captured with context-sensitive grammar.

Thus the following context-free grammar rule

S \rightarrow gSc S \rightarrow cSg, S \rightarrow aSt S \rightarrow tSa S \rightarrow \emptyset constitutes a variety of biological palindromes.

Also, pseudoknots are forms of secondary structure that require context-sensitive expression in the general case, as do repeated sequences that are common in DNA. Thus the hierarchy of formal languages can capture the resulting biological languages after a series of editing operations, mutations, and manipulations on initial DNA sequences.

In the study of theoretical analysis on DNA sequences in molecular biology, Tom Head introduced the notion of splicing system as a mathematical model of restriction enzyme digestion and subsequent religation in the recombination of DNA molecules. He showed an interesting relationship between splicing languages generated by splicing systems and regular languages. In particular, one of the most significant results for our purpose is the equivalence between a certain type of splicing languages, called persistent splicing languages, and a subclass of regular languages called locally testable languages.This result is crucially important as this can bridge the gap between mathematical analysis in molecule biology and formal language theory in computer science.

Thus nowadays, the notion of a locally testability in string sequences has been widely recognized to be of great importance in computer science. On the other hand, in molecule genetics that a certain region of DNA sequence often bears a similar kind of locality for example, the existence of specific DNA sequences of fixed lengths such as TATA box and CCAAT box in eukaryotic promoter regions.

possibly very far away from the original site, and then modified. Other genome-level mutations of importance include inversions, where a segment of DNA is reversed; translocations, where the ends of two chromosomes (telemeres) are exchanged; and transpositions, where two adjacent segments of DNA exchange places. These kinds of large-scale mutations are collectively called genome rearrangements.

Studying genome rearrangements may allow certain evolutionary deductions that are not possible from gene-level comparisons. In many organisms, genome-level mutations occur more slowly than gene-level mutations. This allows one to compare molecular data from species that diverged a very long time ago. For example, to deduce the divergence order of three species, one traditionally does three pairwise similarity computations using a protein common to the three species. Then, provided with the similarity values are consistent, one reasons that the pair with the highest similarity diverged from a common ancestor, which had earlier diverged from the third species. But if divergence times are far enough in the past, the accumulated gene mutations may be so great that all the three pairwise similarity values may be essentially indistinguishable and unusable to reliably determine a fine-grain history of divergence. In contrast, more slowly occurring genome rearrangements may still allow detailed historical deductions. Even though a gene has mutated at the DNA level, one can still recognize it as the same gene in two different species from its protein product and/or function. This allows one to compare the order that genes appear on a chromosome in the different species and to use differing gene order to deduce evolutionary history far back in time.

Summary

Biologists have developed several means to characterize the similarity between two DNA sequences. One intuitively appealing measure is edit distance, which is defined to be the minimum cost of transforming one sequence into another through a series of the following operation: deletion of a nucleotide, insertion of a nucleotide, and substitution of one nucleotide for another. Each operation has an associated cost, which is a function of the nucleotides involved in the operation. The cost of the transformation is then the sum of the costs of the individual operations. Sequence comparison problem can be referred to as the problem of computing the edit distance between two sequences.

Back to My Home Page