lec05.448

Upload: mukul-bhalla

Post on 03-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 lec05.448

    1/8

    . .CSC 448 Bioinformatics Algorithms Alexander Dekhtyar. .

    Repeated Sequence and Palindrome Detection. . .

    Problem Specifications

    Following string matching, two more string analysis problems occur commonlyin bioinformatics applications: repeated sequence detection and palindrome de-tection. Both problems are introduced below.

    Maximal repeated pairs. A maximal pair of repeated strings or a maximalrepeated pair in a string S = s1. . . sn, is a pair of identical substrings P1 =si. . . si+m and P2 = sj. . . sj+m, P1 = P2, which start at different positions inS(i.e., i =j ), such that si1=sj1 and si+m+1 =sj+m+1.

    Amaximal repeated paircan be represented as a triple i,j,mwhere i and jare starting positions of the substringsP1 andP2 and m is the length ofP1 andP2. Given a string S, the set of all maximal repeated pairs is denoted R(S).

    Palindromes. A stringP = p1. . . pk of even length k is called a palindromeif p1 = pk, p2 = pk1, . . .pk/2 = p1+k/2. A string of odd length P = p1. . . pkis apalindrome ifp1. . . p(k1)/2p(k1)/2+1. . . pk (i.e., an even-length string con-structed out of P by taking out the mid-point character) is an even-lengthpalindrome.

    Meaningful palindromes exist in all human languages. Examples of palin-dromes in English are"dud","madam","never odd or even","some men interpretnine memos", "dont nod", "may a moody baby doom a yam" and "no, itnever propagates if I set a gap or prevention".

    DNA palindromes. In a DNA, a palindrome definition is somewhat dif-ferent. A complimented DNA palindrome is a string S = s1. . . sm in the{A , T, C , G}alphabet with thecomplimentrelation defined ascompliment(A) =T; compliment(T) = A; compliment(C) = G; compliment(G) = C, where:

    ifm is even: s1 = sm, s2 = sm1, . . .pm/2 = p1+m/2.

    ifm is odd: s1 = sm, s2 = sm1, . . .p(m1)/2= p1+(m1)/2.

    Complimented palindromes play an important role in DNA: such sequencesappear often in various regulatory DNA sequences. Also, complimented palin-dromes may form hairpin structures on a single DNA strand: nucleotides on

    1

  • 8/12/2019 lec05.448

    2/8

    each strand, rather than binding to the complimentary nucleotide on the oppo-site strand bind to the complimentary nucleotide of the complimentary palin-drome.

    Often, palindromes in DNA come with gaps. A gapped complimentary DNApalindrome is a string S = P1QP2, such that P1P2 is a complimentary palin-drome, andQ = q1. . . q k, and q1 =compliment(qk).

    A maximal palindrome substring in string S = s1. . . sn, is a string P =si. . . sj , such that Pis a palindrome, and si1 =sj+1.

    Amaximal (gapped) complimentary DNA palindrome substringin stringS=s1. . . sn from the {A , T, C , G} alphabet is a string P =si. . . sj, such that P isa (gapped) complimentary DNA palindrome, andsi1 =sj+1.

    Maximal repeat detection problem. Given a stringS=s1. . . sn find allstrings Pthat are maximal repeated strings in S.

    Example. Consider a stringS=ATTGATTCATTC. This string has two maximalrepeated strings: ATT and ATTC. ATT is represented by two triples: 1, 5, 3and1, 9, 3. ATTC is represented by a single triple 5, 9, 4. Note, that according toour definition of a maximal repeat, 5, 9, 3 does not form a maximal repeated

    sequence.Note also, that despite the fact that ATT is a substring ofATTC, the output of

    an algorithm solving the maximal repeat detection problem must containboth.

    Maximal palindrome detection problem. Given a string S = s1. . . sn,find all maximal palindromes in it.

    Maximal DNA palindrome detection problem. Given a string S =s1. . . snin the{A , T, C , G}alphabet,find allmaximal complimentary DNApalindromes in it.

    Efficient Algorithm for Maximal Repeated String

    Detection

    We use suffix trees to construct an efficient algorithm for determining allmaximal repeated strings. The algorithm is based on the following observation:

    Lemma (Maximal repeats). Let T(S) be a suffix tree of a string S. LetPbe a maximal repeated string in S. Then there exists an internal node v inT(S) whose path label is exactly P.

    The corollary to this lemma is useful for evaluation of our algorithm:

    Theorem. A stringSof lengthn can have no more than n maximal repeats.

    This is so, because there are no more than n internal nodes in T(S).

    2

  • 8/12/2019 lec05.448

    3/8

  • 8/12/2019 lec05.448

    4/8

  • 8/12/2019 lec05.448

    5/8

    1:4

    T A T T

    A T A T +T A T T

    r

    A

    A

    T

    T$

    $

    T

    $

    AT

    $ $

    T

    2:1

    1:11 4

    r

    A

    A

    T

    T$

    $

    T

    A

    T$

    $

    A T A T

    3 2 1:3

    1:2

    1:4

    rAT

    T$

    AT$

    $

    T

    $

    T

    2:2

    1:1

    1:3

    $2:3 A

    T$

    $

    T

    1:22:1

    2:4

    A T A T +

    Figure 2: Transforming a suffix tree into a generalized suffix tree.

    Example. Figure 2 shows the construction of the generalized suffix tree for apair of strings S1 = ATAT and S2 = TATT. The top left tree is T(S1$). On thetop right tree, we performed insertion of the first suffix from S2: TATT. (Arrowindicates new addition to the tree). The bottom tree is a full generalized suffixtree for the pair of strings S1 and S2.

    Lowest Common Ancestor in Trees

    Definition. In a treeT, thelowest common ancestor (lca)of two nodes x andy is the deepest node z that is an ancestor to both x and y .

    Thelcadefinition is illustrated in Figure 3.

    Nave algorithm. Start from x and y, trace their ancestry in parallel untilyou hit a node thats the same.

    Efficient algorithm. With some special preparation, the lca between twonodes in a tree can be found in constant time.

    Forcomplete binary trees, this is a matter of clever marking of the nodes withbinary codes, and a bit-wise XOR operation between the codes of the two nodes.

    For arbitrary trees, a mappingIfrom the nodes of the tree to the nodes of acomplete binary treecan be developed, with the property thatI(lca(u),lca(v)) =lca(I(u), I(v)).

    Longest Common Extension

    Problem definition. Given two strings S1 = s1. . . sm, S2 = t1. . . tn, andtwo numbers i, j, the longest common extension of S1 at i and S2 at j is a

    5

  • 8/12/2019 lec05.448

    6/8

    z lca

    xy

    root

    Figure 3: Illustrating the definition of lowest common ancestor: node z is thelcafor nodes x and y .

    stringP =p1. . . pk, such that si. . . si+k1 = tj. . . tj+k1 = P, butsi+k =tj+k(or either i +k1 = m or j +k1 = n).

    Efficient algorithm. The longest common extension of two strings at twopositions can be computed in constant time with linear pre-processing using thefollowing algorithm:

    1. Create a generalized suffix tree T(S1, S2) and process it to allow lcaqueries.

    2. Find nodes (leaves)viandvj in T(S1, S2) representing the suffixessi. . . smand tj. . . tn.

    3. Find u = lca(vi, vj). Return the path label for u.

    Palindrome detection

    Idea. LetS= s1. . . sn. Consider the stringS =sn. . . s1, i.e.,S

    =reverse(S).The idea behind the linear-time algorithm for even-length palindrome de-tection is based on the following observation:

    LetScontain an even-length palindrome centered immediately aftercharacter sq. Let the radius of this pa lindrome be k. Then the kcharacters starting at position nq+ 1 in string S are identicalto sq. . . sq+k.

    Example. This is illustrated in figure 4. We consider a stringS=ATCAACTGAT.It has a palindrome T CAACT centered right after position q= 4. The reverseof S, S =TAGTCAACTA. Reverses preserve the palindrome. The three-letterextension ofSat positionq+1 = 5 isACT. Similarly, the three-letter extensionat position n q+ 1 = 7 of S is ACT - the reverse of the first half of thepalindrome in S.

    6

  • 8/12/2019 lec05.448

    7/8

    A T C A A C T G A TS:1 2 3 4 5 6 7 8 9 10

    S: T A G T C A A C T A

    q=4

    nq+1 = 7

    Figure 4: Illustrating the key idea of the palindrome detection algorithm foreven-length palindromes. A palindrome can be found as the maximal commonextension in Sand S =reverse(S) at positions q+ 1 and nq+ 1.

    S:

    S:

    A T C A T A C T G A T

    1 2 3 4 5 6 7 8 9 10 11

    q=5

    T A G T C A T A C T A

    nq+1 = 7

    Figure 5: Illustrating the key idea of the palindrome detection algorithm forodd-length palindromes. A palindrome can be found as the maximal common

    extension in Sand S =reverse(S) at positions qand n q+ 1.

    S:

    S:

    1 2 3 4 5 6 7 8 9 10 11

    q=5

    nq+1 = 7

    A T C A T T G A G A T

    A T C T C A A T G A T

    Figure 6: Illustrating the key idea of the palindrome detection algorithm forcomplimentary DNA palindromes. S =reverse(compliment(S)).

    7

  • 8/12/2019 lec05.448

    8/8

    Algorithm. We propose the following algorithm for even-length palindromedetection.

    1. Given a stringS, construct S =reverse(S).

    2. Construct the generalized suffix treeT(S, S).

    3. Forq:= 1 to n1 do:

    (a) Let x= LongestCommonExtension((S: q+ 1), (S :nq+ 1)).

    (b) Ifx >0, there is a palindrome of length 2x centered right after q inS.

    Odd-length palindromes. For odd-length palindromes, we note the if thereis a palindrome in S centered on position q, then the maximal extension of(S : q) and (S : n q+ 1) is going to be equal to the central character of thepalindrome followed by the wing of the palindrome (Figure 5.

    To detect odd-length palindromes, therefore, we check forLongestCommonEx-tension((S: q), (S :nq+ 1)) and ensure that its length is greater than 1.

    Complimentary DNA palindromes. To detect complimentary DNA palin-

    dromesinstead of takingS =reverse(S), we setS= reverse(compliment(S)).Then continue as before. Figure 6 illustrates this idea.

    References

    [1] Dan Gusfield, Algorithms on Strings, Trees and Sequences: ComputerScience and Computational Biology, Cambridge University Press, 1997.

    8