combinatorial pattern matching - hyphy · cse/bimm/beng 181, spring 2010 sergei l kosakovsky pond...

SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010

COMBINATORIAL PATTERN MATCHING

Tuesday, May 4, 2010

mailto:[email protected]



OUTLINE: EXACT MATCHING

Tabulating patterns in long textsShort patterns (direct indexing)

Longer patterns (hash tables)

Finding exact patterns in a textBrute force (run time)

Efficient algorithms (pattern preprocessing)Single pattern: Knuth-Morris-Platt

Multiple patterns: Aho-Corasick algorithm

Efficient algorithms (text preprocessing)Suffix trees

Burrows Wheeler Transform-based





OUTLINE: APPROXIMATE MATCHING

Algorithms for approximate pattern matching

Heuristics behind BLAST

Statistics behind BLAST

Alternatives to BLAST: BLAT, PatternHunter etc.





STRING ENCODINGIt is often necessary to index strings; a convenient way to do this is first to convert strings to integers.

Given a string s of length n on alphabet A (0..c-1), with c=|A| characters, we can define a map code(s)→[0,∞), as

A 0

C 1

G 2

T 3

AGT A=0*16 G=2*4 T=3 11

ATA A=0*16 T=3*4 A=0 12

TGG T=3*16 G=2*4 G=2 58

code(s)→ s[1]cn−1 + s[2]cn−2 + . . . + s[n− 1]c + s[n]

There are cL different L-mers, but at most n-L+1 different L-mers in a text of length n





TABULATING SHORT PATTERNSIf the L is small (e.g. 3 or 4), i.e. the total number of patterns is not too large and many of them are likely to be found in the input text then we could use direct indexing to tabulate/locate strings efficiently

The distribution of short strings in genetic sequences is biologically informative, e.g.

Synonymous codons (triplets of nucleotides, 64 patterns) are often used preferentially in organisms (transcriptional selection, secondary structure, etc)

The distribution of short nucleotide k-mers (e.g. L=4, 256 patterns) can be useful for detecting horizontal (from species to species) gene transfer and gene finding

The location of short amino-acid strings (e.g. L=3, 8000 patterns) is useful for finding seeds for BLAST





SHORT PATTERN SCAN

O(L): naiveO(1): if using the previous code to compute the

current one

Data : Alphabet A, Text T, pattern length pResult: Frequency of each pattern in textR← array(|A|p);1

n← len(T );2

for i:=1 to n-p+1 do3

R [code (T [i : i + p− 1])] + = 1;4

end5

return R;6





TABULATING/LOCATING LONGER PATTERNS

Finding repeats/motifs: ATGGTCTAGGTCCTAGTGGTC

Flanking sequences in genomic rearrangements

Motifs: promoter regions, functional sites, immune targets

Cellular immunity targets in pathogens (e.g. protein 9 mers)

There are too many patterns to store in an array, and even if we could, then the array would be very sparse

E.g. ~512,000,000,000 amino-acid 9-mers, but in an average HIV-1 sequence (~3 aa. kb long) there are at most ~3000 unique 9-mers





HASH TABLESAllow to efficiently (O(1) on average) store and retrieve a small subset of a large universe of records. Hash tables implement associative arrays (dictionaries) in a variety of languages (Python, Perl etc)

The universe (records):e.g. 512,000,000,000 amino-acid 9-mers

The storage:Hash Table (array) << the size of the universe

Hash function: record ➝ hash keyNote: because there are more keys than array

indices, this function is NOT one to one





A SIMPLE HASH FUNCTIONA reasonable hash function (on integer records i) is:

P is a prime number and also the natural size of the hash table

Hash keys range from o to P-1

If the records are uniformly distributed, so will be their hash keys

i→ i mod P

4-mer (256 possible) Integer code Hash Key

ACGT 27 27

CCCA 148 47

TGCC 229 27

P=101

CO

LLIS

ION





COLLISIONSCollisions are frequent even for sparsely populated lightly loaded hash tables

load level α = (number of entries in hash table)/(table size)

The birthday paradox: what is the probability that two people out of a random group of n (<365) people share a birthday (in hash table terms, what is the probability of a collision if people=records and hash keys=birthdays)?

n α P(n)

10 0.027 0.117

23 0.063 0.507

50 0.137 0.97

P (n) = 1− 1×�

1− 1365

�×

�1− 2

365

�. . .

�1− n− 1

365

�





DEALING WITH COLLISIONSSeveral strategies to deal with collisions: the simplest one is chaining

Each hash key is associated with a linked list of all records sharing the hash key

4-mer (256 possible) Integer code Hash Key

AAAA 0 0

AAAC 1 1

CGCC 101 0

Hash Key 0

Hash Key 1

Hash Key 2

...

CGCC AAAA

AAAC

∅





HASH TABLE PERFORMANCE

Retrieving/storing a record in a hash table of size m with load factor α

Worst case - all records have the same key: O(m)

Expected run time is O (1), assuming uniformly distributed records and hash keys

Record is not in the table

Record is in the table

This is because the probability of having many collisions with the same key is quite low (even though the probability of SOME collisions in high)

EN = e−α + α + O (1/m)

ES = 1 + α/2 + O (1/m)





EXACT PATTERN MATCHING

Motivation: Searching a database for a known pattern

Goal: Find all occurrences of a pattern in a text

Input: Pattern P = p[1]…p[n] and text T = t[1]…t[m] (n≤m)

Output: All positions 1< i < (m – n + 1) such that the n-letter substring of text T[i][i+n-1] starting at i matches the pattern P

Desired performance: O(n+m)





BRUTE FORCE PATTERN MATCHING

Text: GGCATC; Pattern: GCAT

Data : Pattern P, Text TResult: The list of positions in T where P occursn← len(P );1

m← len(T );2

for i:=1 to m-n+1 do3

if T[i:i+n-1] = P then4

output i;5

end6

end7

Substring comparison can take from 1 to n (left-to-right) string comparisons

G G C A T CG C A A

i=1 (2 comparisons)G G C A T C

G C A T

i=2 (4 comparisons)G G C A T C

G C A T

i=3 (1 comparison)





BRUTE FORCE RUN TIME

Worst case: O(nm). This can be achieved, for example, by searching for P=AA...C in text T=AA...A, because each substring comparison takes exactly n steps

Expected on random text: O(1). This is because the substring comparison takes on average comparisons (q = 1/alphabet size)

For n = 20 and q = 1/4 (nucleotides), substring comparison will take on average 4/3 operations.

Genetic texts are not random, so the performance may degrade.

1− qn

1− q





IMPROVING THE RUN TIMEThe search pattern can be preprocessed in O(n) time to eliminate backtracking in the text and hence guarantee O(n+m) run time

A variety of procedures, starting with the Knuth-Morris-Pratt algorithm in 1977, take this approach. Makes use of the observation that if a string comparison fails at pattern position i, then we can shift the pattern i-b(i) positions, where b(i) depends on the pattern and continue comparing at position the same or the next position in the text, thus avoiding backtracking.

These types of algorithms are popular in text editors/mutable texts, because they do not require the preprocessing of (large) text

A C A A C G A C A C G A C C A C A A C A G C A A T GA C G A C A C G A C A C A

A C A A C G A C A C G A C C A C A A C A G C A A T GA C G A C A C G A C A C A

SHIFT





EXACT MULTIPLE PATTERN MATCHING

The problem: given a dictionary of D patterns P1,P2,..., PD (total length n) and text T report all occurrences of every pattern in the text.

Arises, for instance when one is comparing multiple patterns against a database

Assuming an efficient implementation of individual pattern comparison, this problem can be solved in O(Dm+n) time by scanning the text D times.

Aho and Corasick (1975) showed how this can be done efficiently in O(m+n) time.

Uses the idea of a trie (from the word retrieval), or prefix trie

Intuitively, we can reduce the amount of work by exploiting repetitions in the patterns.





PREFIX TRIEPatterns: ‘ape’, ‘as’, ‘ease’. Constructed in O(n) time, one word at a time.

Properties of a trie

Stores a set of words in a tree

Each edge is labeled with a letter

Each node labeled with a state (order of creation)

Any two edges sharing a parent node have distinct labels

Each word can be spelled by tracing a path from the root to a leaf

Root

1

a

5

e

2

p

4

s

3

e

6

a

7

s

8

e

Root

1

a

2

p

4

s

3

e

Root

1

a

2

p

3

e





SEARCHING TEXT FOR MULTIPLE PATTERNSUSING A TRIE: THREADING

Suppose we want to search the text ‘appease’ for the occurrences of patterns ‘ape’, ‘as’ and ‘ease’, given their trie.

The naive way to do it is to thread (i.e. spell the word using tree edges from the root) the text starting at position i, until either:

A leaf (or specially marked terminal node) is reached (a match has been found)

Spelling cannot be completed (no match)




CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]]

1

2

p

4

s

X

p

3

e

Root

a

5

e

6

a

7

s

8

e

APPEASEI=1: NO MATCH

APPEASEI=4: MATCH

Root

1

a

5

e

2

p

4

s

3

e

6

a

7

s

8

e

I=5: MATCH APPEASE

1

2

p

4

s

Root

a

5

e

3

e

6

a

7

s

8

e

But we already knew this, because ‘as’ is a part ‘ease’! If we take advantage of this, there will be no need to backtrack in the text, and the algorithm will run in O(n+m).The Aho-Corasick algorithm implements exactly this idea using a finite state automaton starting with the trie and adding shortcuts

e





SUFFIX TREES

A trie that is built on every suffix of a text T (length m), and collapses all interior nodes that have a single child is called a suffix tree.

A very powerful data structure, e.g. given a suffix tree and a pattern P (length n), all k occurrences of P in T can be found in time O(n+k), i.e. independently of the size of the text (but it figures into the construction cost of tree T)

A suffix tree can be built in linear time O (m)





BUILDING A SUFFIX TREE

Example ‘bananas#’. It is convenient to terminate the text with a special character, so that no suffix is a prefix of another suffix (e.g. as in banana). This guarantees that spelling any suffix from the root will end at a leaf.

Construct the suffix tree in two phases from the longest to the shortest suffix:

Phase 1: Spell as much of the suffix from the root as possible

Phase 2: If stopped in the middle of an edge, break the edge and add a new branch spell the rest of the suffix along that branch. Label the leaf with the starting position of the suffix.





Root

1

bananas#

BANANAS#

Root

1

bananas#

2

ananas#

ANANAS#

Root

1

bananas#

2

ananas#

3

nanas#

NANAS#

Root

1

bananas#

N1

ana

3

nanas#

2

nas#

4

s#

ANAS#

NAS#

Root

1

bananas#

N1

ana

N2

na

2

nas#

4

s#

3

nas#

5

s#

AS#

Root

1

bananas#

N3

a

N2

na

N1

na

6

s#

2

nas#

4

s#

3

nas#

5

s#

S# AND #

Root

1

bananas#

N3

a

N2

na

7

s#

8

#

N1

na

6

s#

2

nas#

4

s#

3

nas#

5

s#





SUFFIX TREE PROPERTIESExactly m leaves for text of size m (counting the terminator)

Each interior node has at least two children (except possibly the root); edges with the same parent spell substrings starting with different letters.

The size of the tree is O(m)

Can be constructed in O(m) time

This uses the obser vat ion that dur ing construction, not every suffix has to be spelled all the way from the root (which would lead to quadratic time); suffix links can short circuit the process

Is also memory efficient (about ~5m*sizeof(long) bytes for text without too much difficulty)

Root

1

bananas#

N3

a

N2

na

7

s#

8

#

N1

na

6

s#

2

nas#

4

s#

3

nas#

5

s#





MATCHING PATTERNS USING SUFFIX TREES

Consider the problem of finding pattern ‘an’ in the text ‘bananas#’

Two matches: positions 2 and 4

Thread the pattern onto the tree

Completely spelled: report the index of every leaf below the point where spelling stopped. This is because the pattern is a prefix of every suffix spelled by traversing the rest of the subtree.

Incompletely spelled: no match

Runs in O(n+k) time, where n is the length of the pattern, and k is the number of matches.

Root

1

bananas#

N3

a

N2

na

7

s#

8

#

n

6

s#

N1

a

2

nas#

4

s#

3

nas#

5

s#





FINDING LONGEST COMMON SUBSTRINGS USING SUFFIX TREES

Given two texts: T and U find the longes t cont inuous substring that is common to both texts

Can be done in O (len (T) + len (U)) time.

Construct a suffix tree on T%U$

Find the deepest internal node whose children refer to suffixes starting in T and in U

E.g. T = ‘ACGT’, U = ‘TCGA’

N0

10

$

5

%TCGA$

N3

A

N4

CG

N5

G

N6

T

9

$

1

CGT%TCGA$

7

A$

2

T%TCGA$

8

A$

3

T%TCGA$

4

%TCGA$

6

CGA$





SHORT READ MAPPING

Next generation sequencing (NGS) technologies (454, Solexa, SOLiD) generate gigabases of short (32-500 bp) reads per run

A fundamental bioinformatics task in NGS analysis is to map all the reads to a reference genome: i.e. find all the coordinates in the known genome where a given read is located

ATGGTCTAGGTCCTAGTGGTC

Can take a LONG time to map 15,000,000 reads to a 3 gigabase genome!





BURROWS-WHEELER TRANSFORM BASED MAPPERS

In 1994, Burrows and Wheeler described a lossless text transformation (block sorter), which makes the text easily compressible and is the algorithmic basis of BZIP2

Surprisingly, this transform is also very useful for finding all instances of a given (short) string in a large text, while using very little memory

A number of NGS read mappers now use BWT transformed reference genomes to accelerate mapping by several orders of magnitude.





BWTGiven an input text T=t[1]...t[N], we construct N left-shift rotations of the input text, sort them lexicographically, and map the input text to the last column of the sorted rotations:

E.g. input ABRACA is mapped to CARAAB

Note: sorted rotations make it very easy to find all instances of text in a string (also the idea behind suffix arrays)

A B R A C A

B R A C A A

R A C A A B

A C A A B R

C A A B R A

A A B R A C

ROTATIONS

A A B R A CA B R A C AA C A A B RB R A C A AC A A B R AR A C A A B

SORTED





WHY BOTHER?The text output by BWT tends to contain runs of the same character and be easily compressible by arithmetic, run-length or Huffman coders, e.g.

finalchar sorted rotations(L)a n to decompress. It achieves compressiono n to perform only comparisons to a deptho n transformation} This section describeso n transformation} We use the example ando n treats the right-hand side as the mosta n tree for each 16 kbyte input block, enca n tree in the output stream, then encodesi n turn, set $L[i]$ to be thei n turn, set $R[i]$ to theo n unusual data. Like the algorithm of Mana n use a single set of probabilities tablee n using the positions of the suffixes ini n value at a given point in the vector $Re n we present modifications that improve te n when the block size is quite large. Hoi n which codes that have not been seen ini n with $ch$ appear in the {\em same orderi n with $ch$. In our examo n with Huffman or arithmetic coding. Brio n with figures given by Bell˜\cite{bell}.

Figure 1: Example of sorted rotations. Twenty consecutive rotations from thesorted list of rotations of a version of this paper are shown, together with the finalcharacter of each rotation.

6





INVERSE BWTThe beauty of BWT is that knowing only the output and the position of which sorted row contained the original string, the input can be reconstructed in no worse than O(N log (N)) time.

Step 1: reconstruct the first column of rotations (F) from the last column (L). To do so, we simply sort the characters in L.

Step 2: determine the mapping of predecessor characters and recover the input character by character from the last one

A A B R A C

A B R A C A

A C A A B R

B R A C A A

C A A B R A

R A C A A B

ROTATIONS (M)PREDECESSOR CHARACTERS:

RIGHT SHIFT MATRIX M (M’).SORTED STARTING WITH THE 2ND CHARACTER

C A A B R A

A A B R A C

R A C A A B

A B R A C A

A C A A B R

B R A C A A





Both M and M’ contain every rotation of input text T, i.e. permutations of the same set of strings.

For each row i in M, the last character (L[i]) is the cyclic predecessor of the first character (F[i]) in the original text

We wish to define a transformation, Z(i), that maps the i-th row of M’ to the corresponding row in M (i.e. its cyclic predecessor), using the following observations

M is sorted lexicographically, which implies that all rows of M’ beginning with the same character are also sorted lexicographically, for example rows 1,3,4 (all begin with A).

The row of the i-th occurrence of character ‘X’ in the last column of M corresponds to the row of the i-th occurrence of character ‘X’ in the first column of M’

Z: [0,1,2,3,4,5] → [4,0,5,1,2,3]

A A B R A C

A B R A C A

A C A A B R

B R A C A A

C A A B R A

R A C A A B

C A A B R A

A A B R A CR A C A A B

A B R A C A

A C A A B RB R A C A A

M M’

F L FL

Z

PREDECESSOR





In the original string T, the character that preceded the i-th character of the last column L (BWT output) is L[Z[i]]

Z: [0,1,2,3,4,5] → [4,0,5,1,2,3]

A B R A C A C A R A A BINPUT: T BWT (T) = L

For example, for R (i=2), the predecessor in T is L[Z[2]] = L[5] = B

For B (i=5), it is L[Z[5]] = L[3] = A

If we know the position of the last character of T in L, we can “unwind” the input by repeated application of Z.

Can use an inverse of Z to generate the input string forward





Uses BWT and “opportunistic data structures” (i.e. data structures working directly on compressed data) to build a compressed index of a genome

Storage requirements for T=t[1]...t[N] are bits/character

Searching or k occurrences of a pattern (length m) can implemented in time

Genome Biology 2009, 10:R25

Open Access!""#$%&'()%*)+,%-./0-1(),2"3,4551),63,78+9:-),;!<SoftwareUltrafast and memory-efficient alignment of short DNA sequences to the human genome=)& $%&'()%*3,>0-) ?8%@&)--3,A9B%9 C0@,%&*,D+)E)& $ D%-FG)8'

7**8)55H,>)&+)8,I08,=909&I08(%+9:5,%&*,>0(@1+%+90&%-,=90-0'J3,4&5+9+1+),I08,7*E%&:)*,>0(@1+)8,D+1*9)53,K&9E)859+J,0I,A%8J-%&*3,>0--)'),C%8L3,AM,!"NO!3,KD7.,

>088)5@0&*)&:)H,=)& $%&'()%*.,P(%9-H,-%&'()%*Q:5.1(*.)*1

© 2009 Langmead et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.=0R+9)H,5B08+S8)%*,%-9'&()&+T@U=0R+9)H,%,&)R,1-+8%I%5+,()(08JS)II9:9)&+,+00-,I08,+B),%-9'&()&+,0I,5B08+,MV7,5)W1)&:),8)%*5,+0,-%8'),')&0()5.TX@U

Abstract

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence readsto large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align morethan 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes.Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtrackingalgorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieveeven greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

Rationale4(@80E)()&+5,9&,+B),)II9:9)&:J,0I,MV7,5)W1)&:9&',B%E),G0+BG80%*)&)*,+B),%@@-9:%+90&5,I08,5)W1)&:9&',%&*,*8%(%+9:%--J9&:8)%5)*,+B),59F),0I,5)W1)&:9&',*%+%5)+5.,?):B&0-0'9)5,I80(4--1(9&%,YD%&,M9)'03,>73,KD7Z,%&*,7@@-9)*,=905J5+)(5,Y[05S+)8,>9+J3,>73,KD7Z,B%E),G))&,15)*,+0,@80I9-),()+BJ-%+90&,@%+S+)8&5, YA)M4CSD)WZ, \2]3, +0, (%@, MV7S@80+)9&, 9&+)8%:+90&5Y>B4CSD)WZ,\!]3,%&*,+0,9*)&+9IJ,*9II)8)&+9%--J,)^@8)55)*,')&)5Y;V7SD)WZ,\6],9&,+B),B1(%&,')&0(),%&*,0+B)8,5@):9)5.,?B)4--1(9&%,9&5+81()&+,R%5,8):)&+-J,15)*,+0,8)S5)W1)&:),+B8))B1(%&,')&0()53,0&),I80(,%,:%&:)8,@%+9)&+,%&*,+R0,I80(,@8)SE9015-J,1&5)W1)&:)*,)+B&9:,'801@5,\OS_].,P%:B,0I,+B)5),5+1*S9)5, 8)W198)*, +B), %-9'&()&+, 0I, -%8'), &1(G)85, 0I, 5B08+,MV75)W1)&:)5,Y`5B08+,8)%*5`Z,0&+0,+B),B1(%&,')&0().,[08,)^%(S@-)3,+R0,0I,+B),5+1*9)5,\O3<],15)*,+B),5B08+,8)%*,%-9'&()&+,+00-A%W,\N],+0,%-9'&,(08),+B%&,26",G9--90&,G%5)5,Y%G01+,O<a,:0ES)8%')Z,0I,5B08+,4--1(9&%,8)%*5,+0,%,B1(%&,8)I)8)&:),')&0()9&, 08*)8, +0, *)+):+, ')&)+9:, E%89%+90&5., ?B), +B98*, B1(%&, 8)S5)W1)&:9&', 5+1*J, \_], 15)*, +B), Db7C, @80'8%(, \c], +0, %-9'&(08),+B%&,2"",G9--90&,G%5)5,+0,+B),8)I)8)&:),')&0().,4&,%**9S+90&, +0, +B)5),@80d):+53, +B), 23""",e)&0()5,@80d):+, 95, 9&, +B)@80:)55,0I,159&',B9'BS+B801'B@1+,5)W1)&:9&',9&5+81()&+5,+0

5)W1)&:),%,+0+%-,0I,%G01+,59^,+89--90&,G%5),@%985,0I,B1(%&,MV7\#].

f9+B, )^95+9&',()+B0*53, +B), :0(@1+%+90&%-, :05+, 0I, %-9'&9&'(%&J,5B08+,8)%*5,+0,%,(%((%-9%&,')&0(),95,E)8J,-%8').,[08)^%(@-)3, )^+8%@0-%+9&', I80(, +B), 8)51-+5, @8)5)&+)*, B)8), 9&?%G-)5,2,%&*,!3,0&),:%&,5)),+B%+,A%W,R01-*,8)W198),(08),+B%&<,:)&+8%-,@80:)559&',1&9+,Y>CKZS(0&+B5,%&*,Db7C,(08),+B%&6,>CKSJ)%85,+0,%-9'&,+B),2O",G9--90&,G%5)5,I80(,+B),5+1*J,GJ$)J,%&*,:0R08L)85,\<].,7-+B01'B,159&',A%W,08,Db7C,I08,+B95@18@05), B%5, G))&, 5B0R&, +0, G), I)%59G-), GJ, 159&', (1-+9@-)>CK53,+B)8), 95,%,:-)%8,&))*,I08,&)R,+00-5,+B%+,:0&51(),-)55+9(),%&*,:0(@1+%+90&%-,8)5018:)5.

A%W,%&*,Db7C,+%L),+B),5%(),G%59:,%-'089+B(9:,%@@80%:B,%50+B)8,8):)&+,8)%*,(%@@9&',+00-5,51:B,%5,;A7C,\2"]3,gbbA\22]3,%&*,Dh;9AC,\2!].,P%:B,+00-,G19-*5,%,B%5B,+%G-),0I,5B08+0-9'0()85,@8)5)&+,9&,)9+B)8,+B),8)%*5,YDh;9AC3,A%W3,;A7C3%&*, gbbAZ, 08, +B), 8)I)8)&:), YDb7CZ., D0(), )(@-0J, 8):)&++B)08)+9:%-,%*E%&:)5,+0,%-9'&,8)%*5,W19:L-J,R9+B01+,5%:89I9:9&'5)&59+9E9+J.,[08,)^%(@-)3,gbbA,15)5,`5@%:)*,5))*5`,+0,59'&9IS9:%&+-J,01+@)8I08(,;A7C3,RB9:B,95,G%5)*,0&,%,59(@-)8,%-'0S

Published: 4 March 2009

Genome Biology 2009, 10:R25 (doi:10.1186/gb-2009-10-3-r25)

Received: 21 October 2008Revised: 19 December 2008Accepted: 4 March 2009

The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2009/10/3/R25

Opportunistic Data Structures with Applications

Paolo Ferragina∗ Giovanni Manzini†

Abstract

There is an upsurging interest in designing succinct data structures for basic searching problems(see [23] and references therein). The motivation has to be found in the exponential increase ofelectronic data nowadays available which is even surpassing the significant increase in memory anddisk storage capacities of current computers. Space reduction is an attractive issue because it isalso intimately related to performance improvements as noted by several authors (e.g. Knuth [15],Bentley [5]). In designing these implicit data structures the goal is to reduce as much as possible theauxiliary information kept together with the input data without introducing a significant slowdown inthe final query performance. Yet input data are represented in their entirety thus taking no advantageof possible repetitiveness into them. The importance of those issues is well known to programmerswho typically use various tricks to squeeze data as much as possible and still achieve good queryperformance. Their approaches, though, boil down to heuristics whose effectiveness is witnessed onlyby experimentation.

In this paper, we address the issue of compressing and indexing data by studying it in a theoreticalframework. We devise a novel data structure for indexing and searching whose space occupancy isa function of the entropy of the underlying data set. The novelty resides in the careful combinationof a compression algorithm, proposed by Burrows and Wheeler [7], with the structural properties ofa well known indexing tool, the Suffix Array [17]. We call the data structure opportunistic since itsspace occupancy is decreased when the input is compressible at no significant slowdown in the queryperformance. More precisely, its space occupancy is optimal in an information-content sense becausea text T [1, u] is stored using O(Hk(T )) + o(1) bits per input symbol, where Hk(T ) is the kth orderentropy of T (the bound holds for any fixed k). Given an arbitrary string P [1, p], the opportunistic datastructure allows to search for the occ occurrences of P in T requiring O(p+occ log� u) time complexity(for any fixed � > 0). If data are uncompressible we achieve the best space bound currently known [11];on compressible data our solution improves the succinct suffix array of [11] and the classical suffix treeand suffix array data structures either in space or in query time complexity or both.

It is a belief [27] that some space overhead should be paid to use full-text indices (like suffix treesor suffix arrays) with respect to word-based indices (like inverted lists). The results in this paper showthat a full-text index may achieve sublinear space overhead on compressible texts. As an application wedevise a variant of the well-known Glimpse tool [18] which achieves sublinear space and sublinear querytime complexity. Conversely, inverted lists achieve only the second goal [27], and classical Glimpseachieves both goals but under some restrictive conditions [4].

Finally, we investigate the modifiability of our opportunistic data structure by studying how tochoreograph its basic ideas with a dynamic setting thus achieving effective searching and updatingtime bounds.

∗Dipartimento di Informatica, Universita di Pisa, Italy. E-mail: [email protected].

†Dipartimento di Scienze e Tecnologie Avanzate, Universita del Piemonte Orientale, Alessandria, Italy and IMC-CNR,

Pisa, Italy. E-mail: [email protected].

Opportunistic Data Structures with Applications

Paolo Ferragina∗ Giovanni Manzini†

Abstract

There is an upsurging interest in designing succinct data structures for basic searching problems(see [23] and references therein). The motivation has to be found in the exponential increase ofelectronic data nowadays available which is even surpassing the significant increase in memory anddisk storage capacities of current computers. Space reduction is an attractive issue because it isalso intimately related to performance improvements as noted by several authors (e.g. Knuth [15],Bentley [5]). In designing these implicit data structures the goal is to reduce as much as possible theauxiliary information kept together with the input data without introducing a significant slowdown inthe final query performance. Yet input data are represented in their entirety thus taking no advantageof possible repetitiveness into them. The importance of those issues is well known to programmerswho typically use various tricks to squeeze data as much as possible and still achieve good queryperformance. Their approaches, though, boil down to heuristics whose effectiveness is witnessed onlyby experimentation.

In this paper, we address the issue of compressing and indexing data by studying it in a theoreticalframework. We devise a novel data structure for indexing and searching whose space occupancy isa function of the entropy of the underlying data set. The novelty resides in the careful combinationof a compression algorithm, proposed by Burrows and Wheeler [7], with the structural properties ofa well known indexing tool, the Suffix Array [17]. We call the data structure opportunistic since itsspace occupancy is decreased when the input is compressible at no significant slowdown in the queryperformance. More precisely, its space occupancy is optimal in an information-content sense becausea text T [1, u] is stored using O(Hk(T )) + o(1) bits per input symbol, where Hk(T ) is the kth orderentropy of T (the bound holds for any fixed k). Given an arbitrary string P [1, p], the opportunistic datastructure allows to search for the occ occurrences of P in T requiring O(p+occ log� u) time complexity(for any fixed � > 0). If data are uncompressible we achieve the best space bound currently known [11];on compressible data our solution improves the succinct suffix array of [11] and the classical suffix treeand suffix array data structures either in space or in query time complexity or both.

It is a belief [27] that some space overhead should be paid to use full-text indices (like suffix treesor suffix arrays) with respect to word-based indices (like inverted lists). The results in this paper showthat a full-text index may achieve sublinear space overhead on compressible texts. As an application wedevise a variant of the well-known Glimpse tool [18] which achieves sublinear space and sublinear querytime complexity. Conversely, inverted lists achieve only the second goal [27], and classical Glimpseachieves both goals but under some restrictive conditions [4].

Finally, we investigate the modifiability of our opportunistic data structure by studying how tochoreograph its basic ideas with a dynamic setting thus achieving effective searching and updatingtime bounds.

∗Dipartimento di Informatica, Universita di Pisa, Italy. E-mail: [email protected].

†Dipartimento di Scienze e Tecnologie Avanzate, Universita del Piemonte Orientale, Alessandria, Italy and IMC-CNR,

Pisa, Italy. E-mail: [email protected].

O(m + k log�N),∀� > 0





HASHING VS BWT AND OPPORTUNISTIC DATA

STRUCTUREShttp://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.2

Genome Biology 2009, 10:R25

!"#$%&'()(*+,('&-.&/0(10230(#(4&05'&6(!*(-(!7&89:;<&=,0>('4(('4&$0)(&-((5&4$+?5&#+&."(*'&$"7$(!&4(54"#")"#.&#$05&>+5#"72@+@4& 4(('4& +A& #$(& 40%(& *(57#$& 89BC9D;<& =EF"G6& (%,*+.4& 0>+%-"50#"+5&+A&4,0>('&4(('4&05'&#$(&=%"#$2H0#(!%05&89I;0*7+!"#$%&#+&0*"75&!(0'4&?"#$&$"7$&4(54"#")"#.&0#&#$(&(J,(54(&+A4,(('<& K*05'& "4& 0& >+%%(!>"0*& 0*"75%(5#& ,!+7!0%& 0)0"*0-*(A!+%& L**@%"50& #$0#& @4(4& 0& $04$2-04('& 0*7+!"#$%& #+& 0*"75!(0'4<

/+?#"(&@4(4&0&'"AA(!(5#&05'&5+)(*&"5'(J"57&4#!0#(7.&#+&>!(0#(05& @*#!0A04#C& %(%+!.2(AA">"(5#& 4$+!#& !(0'& 0*"75(!& 7(0!('#+?0!'&%0%%0*"05&!(24(M@(5>"57<&L5&+@!&(J,(!"%(5#4&@4"57!(0'4&A!+%&#$(&9CNNN&O(5+%(4&,!+P(>#C&/+?#"(&0*"754&:D2-04(,0"!& Q-,R& !(0'4&0#&0& !0#(&+A&%+!(& #$05&SD&%"**"+5& !(0'4&,(!T6U2$+@!C&?$">$&"4&%+!(&#$05&:D&#"%(4&A04#(!&#$05&G0M&05':NN&#"%(4&A04#(!&#$05&=VW6&@5'(!&#$(&40%(&>+5'"#"+54&Q4((X0-*(4& 9& 05'& SR<& /+?#"(& (%,*+.4& 0&/@!!+?42H$((*(!& "5'(J-04('&+5&#$(&A@**2#(J#&%"5@#(24,0>(&QYGR&"5'(JC&?$">$&$04&0%(%+!.& A++#,!"5#& +A& +5*.& 0-+@#& 9<:& 7"70-.#(4& QO/R& A+!& #$($@%05&7(5+%(<&X$(&4%0**&A++#,!"5#&0**+?4&/+?#"(&#+&!@5&+5

0&#.,">0*&'(4Z#+,&>+%,@#(!&?"#$&S&O/&+A&FWG<&X$(&"5'(J&"44%0**& (5+@7$& #+& -(&'"4#!"-@#('& +)(!& #$(& "5#(!5(#& 05'& #+& -(4#+!('&+5&'"4Z&05'&!(2@4('<&G@*#",*(&,!+>(44+!&>+!(4&>05&-(@4('&4"%@*#05(+@4*.&#+&0>$"()(&()(5&7!(0#(!&0*"75%(5#&4,(('<H(&$0)(&@4('&/+?#"(&#+&0*"75&9B<:[&>+)(!07(&?+!#$&+A&$@%05L**@%"50&!(0'4&A!+%&#$(&9CNNN&O(5+%(4&,!+P(>#&"5&0-+@#&9B$+@!4&+5&0&4"57*(&'(4Z#+,&>+%,@#(!&?"#$&A+@!&,!+>(44+!&>+!(4<

/+?#"(& %0Z(4& 0& 5@%-(!& +A& >+%,!+%"4(4& #+& 0>$"()(& #$"44,(('C&-@#&#$(4(&#!0'(2+AA4&0!(&!(04+50-*(&?"#$"5&#$(&>+5#(J#+A&%0%%0*"05&!(24(M@(5>"57&,!+P(>#4<&LA&+5(&+!&%+!(&(J0>#%0#>$(4&(J"4#&A+!&0&!(0'C&#$(5&/+?#"(&"4&7@0!05#(('&#+&!(,+!#+5(C&-@#&"A&#$(&-(4#&%0#>$&"4&05&"5(J0>#&+5(&#$(5&/+?#"(&"4&5+#7@0!05#(('&"5&0**&>04(4&#+&A"5'&#$(&$"7$(4#&M@0*"#.&0*"75%(5#<H"#$& "#4& $"7$(4#& ,(!A+!%05>(& 4(##"574C& /+?#"(& %0.& A0"*& #+0*"75&0&4%0**&5@%-(!&+A&!(0'4&?"#$&)0*"'&0*"75%(5#4C&"A&#$+4(!(0'4&$0)(&%@*#",*(&%"4%0#>$(4<&LA& #$(&4#!+57(!&7@0!05#((40!(&'(4"!('C&/+?#"(&4@,,+!#4&+,#"+54&#$0#&"5>!(04(&0>>@!0>.&0##$(&>+4#&+A&4+%(&,(!A+!%05>(<&Y+!&"54#05>(C&#$(&\22-(4#\&+,#"+5?"**&7@0!05#((&#$0#&0**&0*"75%(5#4&!(,+!#('&0!(&-(4#&"5&#(!%4

Table 1

Bowtie alignment performance versus SOAP and Maq

Platform CPU time Wall clock time Reads mapped per hour (millions)

Peak virtual memory footprint (megabytes)

Bowtie speed-up Reads aligned (%)

Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4

SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351! 67.3

Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9

Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8! 74.7

Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9

Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107! 74.7

The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit.

Table 2

Bowtie alignment performance versus Maq with filtered read set

Platform CPU time Wall clock time Reads mapped per hour (millions)

Peak virtual memory footprint (megabytes)

Bowtie speed up Reads aligned (%)

Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9

Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4! 78.0

Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9

Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2! 78.0

Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit.





INEXACT PATTERN MATCHING

Homologous biological sequences are unlikely to match exactly; evolution drives them apart with mutations for example.

Exact algorithms (e.g. local alignments) are quadratic in time and are too slow for comparing/searching large genomic sequences.

Pattern matching with errors is a fundamental problem in bioinformatics – finding homologs in a database.

Well-performing heuristics are frequently used.





EXAMPLE: LONGEST COMMON SUBSTRING (LCS) IN INFLUENZA A VIRUS (IAV) H5N1 HEMAGGLUTININ

(N=957 FROM 2005+)

Suffix trees can be adapted to efficiently find LCS from a proportion of a set of sequences as well.

The longest fully conserved nucleotide substring in viruses sampled in 2005 or later is merely 8 nucleotides long

This poses significant challenges for even straightforward tasks, such as diagnostic probe design

0

20

40

60

80

1 0.95 0.9 0.85 0.8 0.75 0.7PROPORTION OF SEQUENCES WITH LCS

LEN

GT

H O

F LC

S





K-DIFFERENCES MATCHING

The k-mismatch problem: given a text T (length m), a pattern P (length n) and the maximum tolerable number of mismatches k, output all locations i in T where there are at most k differences between P and T[i:i+n-1]

The k-differences problem: can also match characters to indels (cost 1) -- a generalization.

Both can be easily solved in O(nm) time, by either brute force or dynamic programming

Viskin and Landau (1985) propose an O(m+nk) time algorithm for the k-differences problem by combining dynamic programming with text and pattern preprocessing using suffix trees of T%P$.





QUERY MATCHINGIf the pattern is long (e.g. a new gene sequence), it may be beneficial to look for substrings of the pattern that approximately match the reference (e.g. all genes in GenBank).





QUERY MATCHINGApproximately matching strings share some perfectly matching substrings (L-mers).

Instead of searching for approximately matching strings (difficult, quadratic) search for perfectly matching substrings (easy, linear).

Extend obtained perfect matches to obtain longer approximate matches that are locally optimal.

This is the idea behind probably the most important bioinformatics tool: Basic Local Alignment Search Tool (Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.), 1990

Three primary questions: How to select L? How to extend the seed?How to confirm that the match is biologically relevant?





An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

!"#$%&'()*+,-./&01*2-345&

Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60

+++DN +G + IR L G+K I+ L+ E+ RG++K

Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263

Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD

keyword

GVK 18

GAK 16

GIK 16

GGK 14

GLK 13

GNK 12

GRK 11

GEK 11

GDK 11

neighborhood

score threshold (T = 13)

Neighborhood

words

High-scoring Pair (HSP)

extension





SELECTING SEED SIZE L

If strings X and Y (each length n), match with k<n mismatches, then the longest perfect match between them has at least ceil (n/(k+1)) characters.

Easy to show by the following observation: if there are k+1 bins and k objects then at least one of the bins will be empty.

Partition the strings into k+1 equal length substrings -- at least one of them will have no mismatches.

In fact, if the longest perfect match is expected to be quite a bit longer (at least if the mismatches are randomly distributed), e.g. about 40 for n = 100, k = 5 (expected minimum is 17).





SELECTING SEED SIZE L

Smaller L: easier to find, but decreased performance, and, importantly, specificity – two random sequences are more likely to have a short common substring

Larger L: could miss out many potential matches, leading to decreased sensitivity.

By default BLAST uses L (w, word size) of 3 for protein sequences and 11 for nucleotide sequences.

MEGABLAST (a faster version of BLAST for similar sequences) uses longer seeds.





HOW TO EXTEND THE MATCH?

Gapped local alignment (blastn)

Simple (gapless) extension (original BLAST)

Greedy X-drop alignment (MEGABLAST)

...

A tradeoff between speed and accuracy





HOW TO SCORE MATCHES?Biological sequences are not random

some letters are more frequent than others (e.g. in HIV-1 40% of the genome is A)

some mismatches are more common than others in homologous sequences (e.g. due to selection, chemical properties of the residues etc), and should be weighed differently.

BLAST introduces a weighting function on residues: δ(i,j) which assigns a score to a pair of residues.

For nucleotides it is 5 for i=j and -4 otherwise.

For proteins it is based on a large training dataset of homologous sequences (Point Accepted Mutations matrices). PAM120 is roughly equivalent to substitutions accumulated over 120 million years of evolution in an average protein

AA

R

R

N

N

D

D

C

C

Q

Q

E

E

G

G

H

H

I

I

L

L

K

K

M

M

F

F

P

P

S

S

T

T

W

W

Y

Y

V

V

HIV-WITHIN





HOW TO COMPUTE SIGNIFICANCE?

Before a search is done we need to decide what a good cutoff value H for a match is.

It is determined by computing the probability that two random sequences will have at least one match scoring H or greater.

Uses Altschul-Dembo-Karlin statistics (1990-1991)





STATISTICS OF SCORESGiven a segment pair H between two sequences, comprised of r-character substrings T1 and T2, we compute the score of the H as:

We are interested in finding out how likely the maximal score for any segment pair of two random sequences is to exceed some threshold X

Dembo and Karlin (1990) showed that

The mean value for the maximum score between two segment pairs of two random sequences (lengths n and m), assuming a few things about δ(i,j)), is approximately

s(H) =r�

i=1

δ(T1[i], T2[i])

M = log(nm)/λ∗SOLVES �

i,j

piqj exp(λδ(pi, qj))=0





STATISTICS OF SCORES (CONT’D)

For biological sequences, high scoring real matches should greatly exceed the random expectation and the probability that this happens (x is the difference between the mean and the expectation) is

K and λ are expressions that depend on the scoring matrix and letter frequencies, and the distribution is similar to other extreme value distributions.

One can show that the expected number of HSPs – high scoring segment pairs, exceeding the threshold S’ is

Prob{S(H) > x + mean} ≤ K∗ exp(−λ

∗x)

E� = Kmne−λS�





Random

Mutated

0

500

1000

10 15 20 25Log(mn)

Mean HSP

2. Secondly, the number of scores exceeding the mean are supposed to follow a Poissondistribution, or decay exponentially as the function of x = score−Mexpected. Considerthe simulation based on sequences of length 217. As you move away from the mean,the number of replicates scoring x points above the mean drops exponentially.

0

100

200

300

400

75 80 85 90 95 100Score

Count

Random

Mutated

0

500

1000

10 15 20 25Log(mn)

Mean HSP

2. Secondly, the number of scores exceeding the mean are supposed to follow a Poissondistribution, or decay exponentially as the function of x = score−Mexpected. Considerthe simulation based on sequences of length 217. As you move away from the mean,the number of replicates scoring x points above the mean drops exponentially.

0

100

200

300

400

75 80 85 90 95 100Score

Count





E-VALUESBecause thresholds are determined by the algorithm internally, it is better to ‘normalize’ the result as follows:

S =λS� − log K

log 2E = nm2−S

BIT SCORE E-VALUE

exp−E Ek/k!POISSON DISTRIBUTION FOR THE NUMBER K OF HSPS WITH SCORES ≥ SPROBABILITY OF FINDING AT LEAST ONE:

1− exp−E





TIMELINE

1970: Needleman-Wunsch global alignment algorithm

1981: Smith-Waterman local alignment algorithm

1985: FASTA

1990: BLAST (basic local alignment search tool)

2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve!

BLAT

Pattern Hunter





BLAT VS. BLAST

BLAT (BLAST-Like Alignment Tool): same idea as BLAST - locate short sequence hits and extend (developed by J Kent at UCSC)

BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database

Index is stored in RAM resulting in faster searches

Longer K-mers and greedier extensions specifically designed for highly similar sequences (e.g > 95% nucleotide, >85% protein)





BLAT INDEXING

Here is an example with k = 3:

Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15

cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac 0 1 2 3 4 5 6

Hits: aat 4 cac 1,10 clump: cacAATtatCACgaccgc

Multiple instances map to

single index!

Position of 3-mer in query, genome




combinatorial pattern matching - hyphy · cse/bimm/beng 181, spring 2010 sergei l kosakovsky pond...

Documents