combinatorial pattern matching - hyphy · cse/bimm/beng 181, spring 2010 sergei l kosakovsky pond...
TRANSCRIPT
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
COMBINATORIAL PATTERN MATCHING
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
OUTLINE: EXACT MATCHING
Tabulating patterns in long textsShort patterns (direct indexing)
Longer patterns (hash tables)
Finding exact patterns in a textBrute force (run time)
Efficient algorithms (pattern preprocessing)Single pattern: Knuth-Morris-Platt
Multiple patterns: Aho-Corasick algorithm
Efficient algorithms (text preprocessing)Suffix trees
Burrows Wheeler Transform-based
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
OUTLINE: APPROXIMATE MATCHING
Algorithms for approximate pattern matching
Heuristics behind BLAST
Statistics behind BLAST
Alternatives to BLAST: BLAT, PatternHunter etc.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
STRING ENCODINGIt is often necessary to index strings; a convenient way to do this is first to convert strings to integers.
Given a string s of length n on alphabet A (0..c-1), with c=|A| characters, we can define a map code(s)→[0,∞), as
A 0
C 1
G 2
T 3
AGT A=0*16 G=2*4 T=3 11
ATA A=0*16 T=3*4 A=0 12
TGG T=3*16 G=2*4 G=2 58
code(s)→ s[1]cn−1 + s[2]cn−2 + . . . + s[n− 1]c + s[n]
There are cL different L-mers, but at most n-L+1 different L-mers in a text of length n
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
TABULATING SHORT PATTERNSIf the L is small (e.g. 3 or 4), i.e. the total number of patterns is not too large and many of them are likely to be found in the input text then we could use direct indexing to tabulate/locate strings efficiently
The distribution of short strings in genetic sequences is biologically informative, e.g.
Synonymous codons (triplets of nucleotides, 64 patterns) are often used preferentially in organisms (transcriptional selection, secondary structure, etc)
The distribution of short nucleotide k-mers (e.g. L=4, 256 patterns) can be useful for detecting horizontal (from species to species) gene transfer and gene finding
The location of short amino-acid strings (e.g. L=3, 8000 patterns) is useful for finding seeds for BLAST
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SHORT PATTERN SCAN
O(L): naiveO(1): if using the previous code to compute the
current one
Data : Alphabet A, Text T, pattern length pResult: Frequency of each pattern in textR← array(|A|p);1
n← len(T );2
for i:=1 to n-p+1 do3
R [code (T [i : i + p− 1])] + = 1;4
end5
return R;6
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
TABULATING/LOCATING LONGER PATTERNS
Finding repeats/motifs: ATGGTCTAGGTCCTAGTGGTC
Flanking sequences in genomic rearrangements
Motifs: promoter regions, functional sites, immune targets
Cellular immunity targets in pathogens (e.g. protein 9 mers)
There are too many patterns to store in an array, and even if we could, then the array would be very sparse
E.g. ~512,000,000,000 amino-acid 9-mers, but in an average HIV-1 sequence (~3 aa. kb long) there are at most ~3000 unique 9-mers
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
HASH TABLESAllow to efficiently (O(1) on average) store and retrieve a small subset of a large universe of records. Hash tables implement associative arrays (dictionaries) in a variety of languages (Python, Perl etc)
The universe (records):e.g. 512,000,000,000 amino-acid 9-mers
The storage:Hash Table (array) << the size of the universe
Hash function: record ➝ hash keyNote: because there are more keys than array
indices, this function is NOT one to one
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
A SIMPLE HASH FUNCTIONA reasonable hash function (on integer records i) is:
P is a prime number and also the natural size of the hash table
Hash keys range from o to P-1
If the records are uniformly distributed, so will be their hash keys
i→ i mod P
4-mer (256 possible) Integer code Hash Key
ACGT 27 27
CCCA 148 47
TGCC 229 27
P=101
CO
LLIS
ION
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
COLLISIONSCollisions are frequent even for sparsely populated lightly loaded hash tables
load level α = (number of entries in hash table)/(table size)
The birthday paradox: what is the probability that two people out of a random group of n (<365) people share a birthday (in hash table terms, what is the probability of a collision if people=records and hash keys=birthdays)?
n α P(n)
10 0.027 0.117
23 0.063 0.507
50 0.137 0.97
P (n) = 1− 1×�
1− 1365
�×
�1− 2
365
�. . .
�1− n− 1
365
�
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
DEALING WITH COLLISIONSSeveral strategies to deal with collisions: the simplest one is chaining
Each hash key is associated with a linked list of all records sharing the hash key
4-mer (256 possible) Integer code Hash Key
AAAA 0 0
AAAC 1 1
CGCC 101 0
Hash Key 0
Hash Key 1
Hash Key 2
...
CGCC AAAA
AAAC
∅
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
HASH TABLE PERFORMANCE
Retrieving/storing a record in a hash table of size m with load factor α
Worst case - all records have the same key: O(m)
Expected run time is O (1), assuming uniformly distributed records and hash keys
Record is not in the table
Record is in the table
This is because the probability of having many collisions with the same key is quite low (even though the probability of SOME collisions in high)
EN = e−α + α + O (1/m)
ES = 1 + α/2 + O (1/m)
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
EXACT PATTERN MATCHING
Motivation: Searching a database for a known pattern
Goal: Find all occurrences of a pattern in a text
Input: Pattern P = p[1]…p[n] and text T = t[1]…t[m] (n≤m)
Output: All positions 1< i < (m – n + 1) such that the n-letter substring of text T[i][i+n-1] starting at i matches the pattern P
Desired performance: O(n+m)
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BRUTE FORCE PATTERN MATCHING
Text: GGCATC; Pattern: GCAT
Data : Pattern P, Text TResult: The list of positions in T where P occursn← len(P );1
m← len(T );2
for i:=1 to m-n+1 do3
if T[i:i+n-1] = P then4
output i;5
end6
end7
Substring comparison can take from 1 to n (left-to-right) string comparisons
G G C A T CG C A A
i=1 (2 comparisons)G G C A T C
G C A T
i=2 (4 comparisons)G G C A T C
G C A T
i=3 (1 comparison)
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BRUTE FORCE RUN TIME
Worst case: O(nm). This can be achieved, for example, by searching for P=AA...C in text T=AA...A, because each substring comparison takes exactly n steps
Expected on random text: O(1). This is because the substring comparison takes on average comparisons (q = 1/alphabet size)
For n = 20 and q = 1/4 (nucleotides), substring comparison will take on average 4/3 operations.
Genetic texts are not random, so the performance may degrade.
1− qn
1− q
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
IMPROVING THE RUN TIMEThe search pattern can be preprocessed in O(n) time to eliminate backtracking in the text and hence guarantee O(n+m) run time
A variety of procedures, starting with the Knuth-Morris-Pratt algorithm in 1977, take this approach. Makes use of the observation that if a string comparison fails at pattern position i, then we can shift the pattern i-b(i) positions, where b(i) depends on the pattern and continue comparing at position the same or the next position in the text, thus avoiding backtracking.
These types of algorithms are popular in text editors/mutable texts, because they do not require the preprocessing of (large) text
A C A A C G A C A C G A C C A C A A C A G C A A T GA C G A C A C G A C A C A
A C A A C G A C A C G A C C A C A A C A G C A A T GA C G A C A C G A C A C A
SHIFT
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
EXACT MULTIPLE PATTERN MATCHING
The problem: given a dictionary of D patterns P1,P2,..., PD (total length n) and text T report all occurrences of every pattern in the text.
Arises, for instance when one is comparing multiple patterns against a database
Assuming an efficient implementation of individual pattern comparison, this problem can be solved in O(Dm+n) time by scanning the text D times.
Aho and Corasick (1975) showed how this can be done efficiently in O(m+n) time.
Uses the idea of a trie (from the word retrieval), or prefix trie
Intuitively, we can reduce the amount of work by exploiting repetitions in the patterns.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
PREFIX TRIEPatterns: ‘ape’, ‘as’, ‘ease’. Constructed in O(n) time, one word at a time.
Properties of a trie
Stores a set of words in a tree
Each edge is labeled with a letter
Each node labeled with a state (order of creation)
Any two edges sharing a parent node have distinct labels
Each word can be spelled by tracing a path from the root to a leaf
Root
1
a
5
e
2
p
4
s
3
e
6
a
7
s
8
e
Root
1
a
2
p
4
s
3
e
Root
1
a
2
p
3
e
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SEARCHING TEXT FOR MULTIPLE PATTERNSUSING A TRIE: THREADING
Suppose we want to search the text ‘appease’ for the occurrences of patterns ‘ape’, ‘as’ and ‘ease’, given their trie.
The naive way to do it is to thread (i.e. spell the word using tree edges from the root) the text starting at position i, until either:
A leaf (or specially marked terminal node) is reached (a match has been found)
Spelling cannot be completed (no match)
Tuesday, May 4, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]]
1
2
p
4
s
X
p
3
e
Root
a
5
e
6
a
7
s
8
e
APPEASEI=1: NO MATCH
APPEASEI=4: MATCH
Root
1
a
5
e
2
p
4
s
3
e
6
a
7
s
8
e
I=5: MATCH APPEASE
1
2
p
4
s
Root
a
5
e
3
e
6
a
7
s
8
e
But we already knew this, because ‘as’ is a part ‘ease’! If we take advantage of this, there will be no need to backtrack in the text, and the algorithm will run in O(n+m).The Aho-Corasick algorithm implements exactly this idea using a finite state automaton starting with the trie and adding shortcuts
e
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SUFFIX TREES
A trie that is built on every suffix of a text T (length m), and collapses all interior nodes that have a single child is called a suffix tree.
A very powerful data structure, e.g. given a suffix tree and a pattern P (length n), all k occurrences of P in T can be found in time O(n+k), i.e. independently of the size of the text (but it figures into the construction cost of tree T)
A suffix tree can be built in linear time O (m)
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BUILDING A SUFFIX TREE
Example ‘bananas#’. It is convenient to terminate the text with a special character, so that no suffix is a prefix of another suffix (e.g. as in banana). This guarantees that spelling any suffix from the root will end at a leaf.
Construct the suffix tree in two phases from the longest to the shortest suffix:
Phase 1: Spell as much of the suffix from the root as possible
Phase 2: If stopped in the middle of an edge, break the edge and add a new branch spell the rest of the suffix along that branch. Label the leaf with the starting position of the suffix.
Tuesday, May 4, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]]
Root
1
bananas#
BANANAS#
Root
1
bananas#
2
ananas#
ANANAS#
Root
1
bananas#
2
ananas#
3
nanas#
NANAS#
Root
1
bananas#
N1
ana
3
nanas#
2
nas#
4
s#
ANAS#
NAS#
Root
1
bananas#
N1
ana
N2
na
2
nas#
4
s#
3
nas#
5
s#
AS#
Root
1
bananas#
N3
a
N2
na
N1
na
6
s#
2
nas#
4
s#
3
nas#
5
s#
S# AND #
Root
1
bananas#
N3
a
N2
na
7
s#
8
#
N1
na
6
s#
2
nas#
4
s#
3
nas#
5
s#
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SUFFIX TREE PROPERTIESExactly m leaves for text of size m (counting the terminator)
Each interior node has at least two children (except possibly the root); edges with the same parent spell substrings starting with different letters.
The size of the tree is O(m)
Can be constructed in O(m) time
This uses the obser vat ion that dur ing construction, not every suffix has to be spelled all the way from the root (which would lead to quadratic time); suffix links can short circuit the process
Is also memory efficient (about ~5m*sizeof(long) bytes for text without too much difficulty)
Root
1
bananas#
N3
a
N2
na
7
s#
8
#
N1
na
6
s#
2
nas#
4
s#
3
nas#
5
s#
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
MATCHING PATTERNS USING SUFFIX TREES
Consider the problem of finding pattern ‘an’ in the text ‘bananas#’
Two matches: positions 2 and 4
Thread the pattern onto the tree
Completely spelled: report the index of every leaf below the point where spelling stopped. This is because the pattern is a prefix of every suffix spelled by traversing the rest of the subtree.
Incompletely spelled: no match
Runs in O(n+k) time, where n is the length of the pattern, and k is the number of matches.
Root
1
bananas#
N3
a
N2
na
7
s#
8
#
n
6
s#
N1
a
2
nas#
4
s#
3
nas#
5
s#
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
FINDING LONGEST COMMON SUBSTRINGS USING SUFFIX TREES
Given two texts: T and U find the longes t cont inuous substring that is common to both texts
Can be done in O (len (T) + len (U)) time.
Construct a suffix tree on T%U$
Find the deepest internal node whose children refer to suffixes starting in T and in U
E.g. T = ‘ACGT’, U = ‘TCGA’
N0
10
$
5
%TCGA$
N3
A
N4
CG
N5
G
N6
T
9
$
1
CGT%TCGA$
7
A$
2
T%TCGA$
8
A$
3
T%TCGA$
4
%TCGA$
6
CGA$
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SHORT READ MAPPING
Next generation sequencing (NGS) technologies (454, Solexa, SOLiD) generate gigabases of short (32-500 bp) reads per run
A fundamental bioinformatics task in NGS analysis is to map all the reads to a reference genome: i.e. find all the coordinates in the known genome where a given read is located
ATGGTCTAGGTCCTAGTGGTC
Can take a LONG time to map 15,000,000 reads to a 3 gigabase genome!
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BURROWS-WHEELER TRANSFORM BASED MAPPERS
In 1994, Burrows and Wheeler described a lossless text transformation (block sorter), which makes the text easily compressible and is the algorithmic basis of BZIP2
Surprisingly, this transform is also very useful for finding all instances of a given (short) string in a large text, while using very little memory
A number of NGS read mappers now use BWT transformed reference genomes to accelerate mapping by several orders of magnitude.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BWTGiven an input text T=t[1]...t[N], we construct N left-shift rotations of the input text, sort them lexicographically, and map the input text to the last column of the sorted rotations:
E.g. input ABRACA is mapped to CARAAB
Note: sorted rotations make it very easy to find all instances of text in a string (also the idea behind suffix arrays)
A B R A C A
B R A C A A
R A C A A B
A C A A B R
C A A B R A
A A B R A C
ROTATIONS
A A B R A CA B R A C AA C A A B RB R A C A AC A A B R AR A C A A B
SORTED
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
WHY BOTHER?The text output by BWT tends to contain runs of the same character and be easily compressible by arithmetic, run-length or Huffman coders, e.g.
finalchar sorted rotations(L)a n to decompress. It achieves compressiono n to perform only comparisons to a deptho n transformation} This section describeso n transformation} We use the example ando n treats the right-hand side as the mosta n tree for each 16 kbyte input block, enca n tree in the output stream, then encodesi n turn, set $L[i]$ to be thei n turn, set $R[i]$ to theo n unusual data. Like the algorithm of Mana n use a single set of probabilities tablee n using the positions of the suffixes ini n value at a given point in the vector $Re n we present modifications that improve te n when the block size is quite large. Hoi n which codes that have not been seen ini n with $ch$ appear in the {\em same orderi n with $ch$. In our examo n with Huffman or arithmetic coding. Brio n with figures given by Bell˜\cite{bell}.
Figure 1: Example of sorted rotations. Twenty consecutive rotations from thesorted list of rotations of a version of this paper are shown, together with the finalcharacter of each rotation.
6
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
INVERSE BWTThe beauty of BWT is that knowing only the output and the position of which sorted row contained the original string, the input can be reconstructed in no worse than O(N log (N)) time.
Step 1: reconstruct the first column of rotations (F) from the last column (L). To do so, we simply sort the characters in L.
Step 2: determine the mapping of predecessor characters and recover the input character by character from the last one
A A B R A C
A B R A C A
A C A A B R
B R A C A A
C A A B R A
R A C A A B
ROTATIONS (M)PREDECESSOR CHARACTERS:
RIGHT SHIFT MATRIX M (M’).SORTED STARTING WITH THE 2ND CHARACTER
C A A B R A
A A B R A C
R A C A A B
A B R A C A
A C A A B R
B R A C A A
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
Both M and M’ contain every rotation of input text T, i.e. permutations of the same set of strings.
For each row i in M, the last character (L[i]) is the cyclic predecessor of the first character (F[i]) in the original text
We wish to define a transformation, Z(i), that maps the i-th row of M’ to the corresponding row in M (i.e. its cyclic predecessor), using the following observations
M is sorted lexicographically, which implies that all rows of M’ beginning with the same character are also sorted lexicographically, for example rows 1,3,4 (all begin with A).
The row of the i-th occurrence of character ‘X’ in the last column of M corresponds to the row of the i-th occurrence of character ‘X’ in the first column of M’
Z: [0,1,2,3,4,5] → [4,0,5,1,2,3]
A A B R A C
A B R A C A
A C A A B R
B R A C A A
C A A B R A
R A C A A B
C A A B R A
A A B R A CR A C A A B
A B R A C A
A C A A B RB R A C A A
M M’
F L FL
Z
PREDECESSOR
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
In the original string T, the character that preceded the i-th character of the last column L (BWT output) is L[Z[i]]
Z: [0,1,2,3,4,5] → [4,0,5,1,2,3]
A B R A C A C A R A A BINPUT: T BWT (T) = L
For example, for R (i=2), the predecessor in T is L[Z[2]] = L[5] = B
For B (i=5), it is L[Z[5]] = L[3] = A
If we know the position of the last character of T in L, we can “unwind” the input by repeated application of Z.
Can use an inverse of Z to generate the input string forward
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
Uses BWT and “opportunistic data structures” (i.e. data structures working directly on compressed data) to build a compressed index of a genome
Storage requirements for T=t[1]...t[N] are bits/character
Searching or k occurrences of a pattern (length m) can implemented in time
Genome Biology 2009, 10:R25
Open Access!""#$%&'()%*)+,%-./0-1(),2"3,4551),63,78+9:-),;!<SoftwareUltrafast and memory-efficient alignment of short DNA sequences to the human genome=)& $%&'()%*3,>0-) ?8%@&)--3,A9B%9 C0@,%&*,D+)E)& $ D%-FG)8'
7**8)55H,>)&+)8,I08,=909&I08(%+9:5,%&*,>0(@1+%+90&%-,=90-0'J3,4&5+9+1+),I08,7*E%&:)*,>0(@1+)8,D+1*9)53,K&9E)859+J,0I,A%8J-%&*3,>0--)'),C%8L3,AM,!"NO!3,KD7.,
>088)5@0&*)&:)H,=)& $%&'()%*.,P(%9-H,-%&'()%*Q:5.1(*.)*1
© 2009 Langmead et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.=0R+9)H,5B08+S8)%*,%-9'&()&+T@U=0R+9)H,%,&)R,1-+8%I%5+,()(08JS)II9:9)&+,+00-,I08,+B),%-9'&()&+,0I,5B08+,MV7,5)W1)&:),8)%*5,+0,-%8'),')&0()5.TX@U
Abstract
Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence readsto large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align morethan 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes.Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtrackingalgorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieveeven greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.
Rationale4(@80E)()&+5,9&,+B),)II9:9)&:J,0I,MV7,5)W1)&:9&',B%E),G0+BG80%*)&)*,+B),%@@-9:%+90&5,I08,5)W1)&:9&',%&*,*8%(%+9:%--J9&:8)%5)*,+B),59F),0I,5)W1)&:9&',*%+%5)+5.,?):B&0-0'9)5,I80(4--1(9&%,YD%&,M9)'03,>73,KD7Z,%&*,7@@-9)*,=905J5+)(5,Y[05S+)8,>9+J3,>73,KD7Z,B%E),G))&,15)*,+0,@80I9-),()+BJ-%+90&,@%+S+)8&5, YA)M4CSD)WZ, \2]3, +0, (%@, MV7S@80+)9&, 9&+)8%:+90&5Y>B4CSD)WZ,\!]3,%&*,+0,9*)&+9IJ,*9II)8)&+9%--J,)^@8)55)*,')&)5Y;V7SD)WZ,\6],9&,+B),B1(%&,')&0(),%&*,0+B)8,5@):9)5.,?B)4--1(9&%,9&5+81()&+,R%5,8):)&+-J,15)*,+0,8)S5)W1)&:),+B8))B1(%&,')&0()53,0&),I80(,%,:%&:)8,@%+9)&+,%&*,+R0,I80(,@8)SE9015-J,1&5)W1)&:)*,)+B&9:,'801@5,\OS_].,P%:B,0I,+B)5),5+1*S9)5, 8)W198)*, +B), %-9'&()&+, 0I, -%8'), &1(G)85, 0I, 5B08+,MV75)W1)&:)5,Y`5B08+,8)%*5`Z,0&+0,+B),B1(%&,')&0().,[08,)^%(S@-)3,+R0,0I,+B),5+1*9)5,\O3<],15)*,+B),5B08+,8)%*,%-9'&()&+,+00-A%W,\N],+0,%-9'&,(08),+B%&,26",G9--90&,G%5)5,Y%G01+,O<a,:0ES)8%')Z,0I,5B08+,4--1(9&%,8)%*5,+0,%,B1(%&,8)I)8)&:),')&0()9&, 08*)8, +0, *)+):+, ')&)+9:, E%89%+90&5., ?B), +B98*, B1(%&, 8)S5)W1)&:9&', 5+1*J, \_], 15)*, +B), Db7C, @80'8%(, \c], +0, %-9'&(08),+B%&,2"",G9--90&,G%5)5,+0,+B),8)I)8)&:),')&0().,4&,%**9S+90&, +0, +B)5),@80d):+53, +B), 23""",e)&0()5,@80d):+, 95, 9&, +B)@80:)55,0I,159&',B9'BS+B801'B@1+,5)W1)&:9&',9&5+81()&+5,+0
5)W1)&:),%,+0+%-,0I,%G01+,59^,+89--90&,G%5),@%985,0I,B1(%&,MV7\#].
f9+B, )^95+9&',()+B0*53, +B), :0(@1+%+90&%-, :05+, 0I, %-9'&9&'(%&J,5B08+,8)%*5,+0,%,(%((%-9%&,')&0(),95,E)8J,-%8').,[08)^%(@-)3, )^+8%@0-%+9&', I80(, +B), 8)51-+5, @8)5)&+)*, B)8), 9&?%G-)5,2,%&*,!3,0&),:%&,5)),+B%+,A%W,R01-*,8)W198),(08),+B%&<,:)&+8%-,@80:)559&',1&9+,Y>CKZS(0&+B5,%&*,Db7C,(08),+B%&6,>CKSJ)%85,+0,%-9'&,+B),2O",G9--90&,G%5)5,I80(,+B),5+1*J,GJ$)J,%&*,:0R08L)85,\<].,7-+B01'B,159&',A%W,08,Db7C,I08,+B95@18@05), B%5, G))&, 5B0R&, +0, G), I)%59G-), GJ, 159&', (1-+9@-)>CK53,+B)8), 95,%,:-)%8,&))*,I08,&)R,+00-5,+B%+,:0&51(),-)55+9(),%&*,:0(@1+%+90&%-,8)5018:)5.
A%W,%&*,Db7C,+%L),+B),5%(),G%59:,%-'089+B(9:,%@@80%:B,%50+B)8,8):)&+,8)%*,(%@@9&',+00-5,51:B,%5,;A7C,\2"]3,gbbA\22]3,%&*,Dh;9AC,\2!].,P%:B,+00-,G19-*5,%,B%5B,+%G-),0I,5B08+0-9'0()85,@8)5)&+,9&,)9+B)8,+B),8)%*5,YDh;9AC3,A%W3,;A7C3%&*, gbbAZ, 08, +B), 8)I)8)&:), YDb7CZ., D0(), )(@-0J, 8):)&++B)08)+9:%-,%*E%&:)5,+0,%-9'&,8)%*5,W19:L-J,R9+B01+,5%:89I9:9&'5)&59+9E9+J.,[08,)^%(@-)3,gbbA,15)5,`5@%:)*,5))*5`,+0,59'&9IS9:%&+-J,01+@)8I08(,;A7C3,RB9:B,95,G%5)*,0&,%,59(@-)8,%-'0S
Published: 4 March 2009
Genome Biology 2009, 10:R25 (doi:10.1186/gb-2009-10-3-r25)
Received: 21 October 2008Revised: 19 December 2008Accepted: 4 March 2009
The electronic version of this article is the complete one and can be found online at http://genomebiology.com/2009/10/3/R25
Opportunistic Data Structures with Applications
Paolo Ferragina∗ Giovanni Manzini†
Abstract
There is an upsurging interest in designing succinct data structures for basic searching problems(see [23] and references therein). The motivation has to be found in the exponential increase ofelectronic data nowadays available which is even surpassing the significant increase in memory anddisk storage capacities of current computers. Space reduction is an attractive issue because it isalso intimately related to performance improvements as noted by several authors (e.g. Knuth [15],Bentley [5]). In designing these implicit data structures the goal is to reduce as much as possible theauxiliary information kept together with the input data without introducing a significant slowdown inthe final query performance. Yet input data are represented in their entirety thus taking no advantageof possible repetitiveness into them. The importance of those issues is well known to programmerswho typically use various tricks to squeeze data as much as possible and still achieve good queryperformance. Their approaches, though, boil down to heuristics whose effectiveness is witnessed onlyby experimentation.
In this paper, we address the issue of compressing and indexing data by studying it in a theoreticalframework. We devise a novel data structure for indexing and searching whose space occupancy isa function of the entropy of the underlying data set. The novelty resides in the careful combinationof a compression algorithm, proposed by Burrows and Wheeler [7], with the structural properties ofa well known indexing tool, the Suffix Array [17]. We call the data structure opportunistic since itsspace occupancy is decreased when the input is compressible at no significant slowdown in the queryperformance. More precisely, its space occupancy is optimal in an information-content sense becausea text T [1, u] is stored using O(Hk(T )) + o(1) bits per input symbol, where Hk(T ) is the kth orderentropy of T (the bound holds for any fixed k). Given an arbitrary string P [1, p], the opportunistic datastructure allows to search for the occ occurrences of P in T requiring O(p+occ log� u) time complexity(for any fixed � > 0). If data are uncompressible we achieve the best space bound currently known [11];on compressible data our solution improves the succinct suffix array of [11] and the classical suffix treeand suffix array data structures either in space or in query time complexity or both.
It is a belief [27] that some space overhead should be paid to use full-text indices (like suffix treesor suffix arrays) with respect to word-based indices (like inverted lists). The results in this paper showthat a full-text index may achieve sublinear space overhead on compressible texts. As an application wedevise a variant of the well-known Glimpse tool [18] which achieves sublinear space and sublinear querytime complexity. Conversely, inverted lists achieve only the second goal [27], and classical Glimpseachieves both goals but under some restrictive conditions [4].
Finally, we investigate the modifiability of our opportunistic data structure by studying how tochoreograph its basic ideas with a dynamic setting thus achieving effective searching and updatingtime bounds.
∗Dipartimento di Informatica, Universita di Pisa, Italy. E-mail: [email protected].
†Dipartimento di Scienze e Tecnologie Avanzate, Universita del Piemonte Orientale, Alessandria, Italy and IMC-CNR,
Pisa, Italy. E-mail: [email protected].
Opportunistic Data Structures with Applications
Paolo Ferragina∗ Giovanni Manzini†
Abstract
There is an upsurging interest in designing succinct data structures for basic searching problems(see [23] and references therein). The motivation has to be found in the exponential increase ofelectronic data nowadays available which is even surpassing the significant increase in memory anddisk storage capacities of current computers. Space reduction is an attractive issue because it isalso intimately related to performance improvements as noted by several authors (e.g. Knuth [15],Bentley [5]). In designing these implicit data structures the goal is to reduce as much as possible theauxiliary information kept together with the input data without introducing a significant slowdown inthe final query performance. Yet input data are represented in their entirety thus taking no advantageof possible repetitiveness into them. The importance of those issues is well known to programmerswho typically use various tricks to squeeze data as much as possible and still achieve good queryperformance. Their approaches, though, boil down to heuristics whose effectiveness is witnessed onlyby experimentation.
In this paper, we address the issue of compressing and indexing data by studying it in a theoreticalframework. We devise a novel data structure for indexing and searching whose space occupancy isa function of the entropy of the underlying data set. The novelty resides in the careful combinationof a compression algorithm, proposed by Burrows and Wheeler [7], with the structural properties ofa well known indexing tool, the Suffix Array [17]. We call the data structure opportunistic since itsspace occupancy is decreased when the input is compressible at no significant slowdown in the queryperformance. More precisely, its space occupancy is optimal in an information-content sense becausea text T [1, u] is stored using O(Hk(T )) + o(1) bits per input symbol, where Hk(T ) is the kth orderentropy of T (the bound holds for any fixed k). Given an arbitrary string P [1, p], the opportunistic datastructure allows to search for the occ occurrences of P in T requiring O(p+occ log� u) time complexity(for any fixed � > 0). If data are uncompressible we achieve the best space bound currently known [11];on compressible data our solution improves the succinct suffix array of [11] and the classical suffix treeand suffix array data structures either in space or in query time complexity or both.
It is a belief [27] that some space overhead should be paid to use full-text indices (like suffix treesor suffix arrays) with respect to word-based indices (like inverted lists). The results in this paper showthat a full-text index may achieve sublinear space overhead on compressible texts. As an application wedevise a variant of the well-known Glimpse tool [18] which achieves sublinear space and sublinear querytime complexity. Conversely, inverted lists achieve only the second goal [27], and classical Glimpseachieves both goals but under some restrictive conditions [4].
Finally, we investigate the modifiability of our opportunistic data structure by studying how tochoreograph its basic ideas with a dynamic setting thus achieving effective searching and updatingtime bounds.
∗Dipartimento di Informatica, Universita di Pisa, Italy. E-mail: [email protected].
†Dipartimento di Scienze e Tecnologie Avanzate, Universita del Piemonte Orientale, Alessandria, Italy and IMC-CNR,
Pisa, Italy. E-mail: [email protected].
O(m + k log�N),∀� > 0
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
HASHING VS BWT AND OPPORTUNISTIC DATA
STRUCTUREShttp://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.2
Genome Biology 2009, 10:R25
!"#$%&'()(*+,('&-.&/0(10230(#(4&05'&6(!*(-(!7&89:;<&=,0>('4(('4&$0)(&-((5&4$+?5&#+&."(*'&$"7$(!&4(54"#")"#.&#$05&>+5#"72@+@4& 4(('4& +A& #$(& 40%(& *(57#$& 89BC9D;<& =EF"G6& (%,*+.4& 0>+%-"50#"+5&+A&4,0>('&4(('4&05'&#$(&=%"#$2H0#(!%05&89I;0*7+!"#$%&#+&0*"75&!(0'4&?"#$&$"7$&4(54"#")"#.&0#&#$(&(J,(54(&+A4,(('<& K*05'& "4& 0& >+%%(!>"0*& 0*"75%(5#& ,!+7!0%& 0)0"*0-*(A!+%& L**@%"50& #$0#& @4(4& 0& $04$2-04('& 0*7+!"#$%& #+& 0*"75!(0'4<
/+?#"(&@4(4&0&'"AA(!(5#&05'&5+)(*&"5'(J"57&4#!0#(7.&#+&>!(0#(05& @*#!0A04#C& %(%+!.2(AA">"(5#& 4$+!#& !(0'& 0*"75(!& 7(0!('#+?0!'&%0%%0*"05&!(24(M@(5>"57<&L5&+@!&(J,(!"%(5#4&@4"57!(0'4&A!+%&#$(&9CNNN&O(5+%(4&,!+P(>#C&/+?#"(&0*"754&:D2-04(,0"!& Q-,R& !(0'4&0#&0& !0#(&+A&%+!(& #$05&SD&%"**"+5& !(0'4&,(!T6U2$+@!C&?$">$&"4&%+!(&#$05&:D&#"%(4&A04#(!&#$05&G0M&05':NN&#"%(4&A04#(!&#$05&=VW6&@5'(!&#$(&40%(&>+5'"#"+54&Q4((X0-*(4& 9& 05'& SR<& /+?#"(& (%,*+.4& 0&/@!!+?42H$((*(!& "5'(J-04('&+5&#$(&A@**2#(J#&%"5@#(24,0>(&QYGR&"5'(JC&?$">$&$04&0%(%+!.& A++#,!"5#& +A& +5*.& 0-+@#& 9<:& 7"70-.#(4& QO/R& A+!& #$($@%05&7(5+%(<&X$(&4%0**&A++#,!"5#&0**+?4&/+?#"(&#+&!@5&+5
0&#.,">0*&'(4Z#+,&>+%,@#(!&?"#$&S&O/&+A&FWG<&X$(&"5'(J&"44%0**& (5+@7$& #+& -(&'"4#!"-@#('& +)(!& #$(& "5#(!5(#& 05'& #+& -(4#+!('&+5&'"4Z&05'&!(2@4('<&G@*#",*(&,!+>(44+!&>+!(4&>05&-(@4('&4"%@*#05(+@4*.&#+&0>$"()(&()(5&7!(0#(!&0*"75%(5#&4,(('<H(&$0)(&@4('&/+?#"(&#+&0*"75&9B<:[&>+)(!07(&?+!#$&+A&$@%05L**@%"50&!(0'4&A!+%&#$(&9CNNN&O(5+%(4&,!+P(>#&"5&0-+@#&9B$+@!4&+5&0&4"57*(&'(4Z#+,&>+%,@#(!&?"#$&A+@!&,!+>(44+!&>+!(4<
/+?#"(& %0Z(4& 0& 5@%-(!& +A& >+%,!+%"4(4& #+& 0>$"()(& #$"44,(('C&-@#&#$(4(&#!0'(2+AA4&0!(&!(04+50-*(&?"#$"5&#$(&>+5#(J#+A&%0%%0*"05&!(24(M@(5>"57&,!+P(>#4<&LA&+5(&+!&%+!(&(J0>#%0#>$(4&(J"4#&A+!&0&!(0'C&#$(5&/+?#"(&"4&7@0!05#(('&#+&!(,+!#+5(C&-@#&"A&#$(&-(4#&%0#>$&"4&05&"5(J0>#&+5(&#$(5&/+?#"(&"4&5+#7@0!05#(('&"5&0**&>04(4&#+&A"5'&#$(&$"7$(4#&M@0*"#.&0*"75%(5#<H"#$& "#4& $"7$(4#& ,(!A+!%05>(& 4(##"574C& /+?#"(& %0.& A0"*& #+0*"75&0&4%0**&5@%-(!&+A&!(0'4&?"#$&)0*"'&0*"75%(5#4C&"A&#$+4(!(0'4&$0)(&%@*#",*(&%"4%0#>$(4<&LA& #$(&4#!+57(!&7@0!05#((40!(&'(4"!('C&/+?#"(&4@,,+!#4&+,#"+54&#$0#&"5>!(04(&0>>@!0>.&0##$(&>+4#&+A&4+%(&,(!A+!%05>(<&Y+!&"54#05>(C&#$(&\22-(4#\&+,#"+5?"**&7@0!05#((&#$0#&0**&0*"75%(5#4&!(,+!#('&0!(&-(4#&"5&#(!%4
Table 1
Bowtie alignment performance versus SOAP and Maq
Platform CPU time Wall clock time Reads mapped per hour (millions)
Peak virtual memory footprint (megabytes)
Bowtie speed-up Reads aligned (%)
Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4
SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351! 67.3
Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9
Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8! 74.7
Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9
Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107! 74.7
The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit.
Table 2
Bowtie alignment performance versus Maq with filtered read set
Platform CPU time Wall clock time Reads mapped per hour (millions)
Peak virtual memory footprint (megabytes)
Bowtie speed up Reads aligned (%)
Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9
Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4! 78.0
Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9
Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2! 78.0
Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
INEXACT PATTERN MATCHING
Homologous biological sequences are unlikely to match exactly; evolution drives them apart with mutations for example.
Exact algorithms (e.g. local alignments) are quadratic in time and are too slow for comparing/searching large genomic sequences.
Pattern matching with errors is a fundamental problem in bioinformatics – finding homologs in a database.
Well-performing heuristics are frequently used.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
EXAMPLE: LONGEST COMMON SUBSTRING (LCS) IN INFLUENZA A VIRUS (IAV) H5N1 HEMAGGLUTININ
(N=957 FROM 2005+)
Suffix trees can be adapted to efficiently find LCS from a proportion of a set of sequences as well.
The longest fully conserved nucleotide substring in viruses sampled in 2005 or later is merely 8 nucleotides long
This poses significant challenges for even straightforward tasks, such as diagnostic probe design
0
20
40
60
80
1 0.95 0.9 0.85 0.8 0.75 0.7PROPORTION OF SEQUENCES WITH LCS
LEN
GT
H O
F LC
S
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
K-DIFFERENCES MATCHING
The k-mismatch problem: given a text T (length m), a pattern P (length n) and the maximum tolerable number of mismatches k, output all locations i in T where there are at most k differences between P and T[i:i+n-1]
The k-differences problem: can also match characters to indels (cost 1) -- a generalization.
Both can be easily solved in O(nm) time, by either brute force or dynamic programming
Viskin and Landau (1985) propose an O(m+nk) time algorithm for the k-differences problem by combining dynamic programming with text and pattern preprocessing using suffix trees of T%P$.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
QUERY MATCHINGIf the pattern is long (e.g. a new gene sequence), it may be beneficial to look for substrings of the pattern that approximately match the reference (e.g. all genes in GenBank).
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
QUERY MATCHINGApproximately matching strings share some perfectly matching substrings (L-mers).
Instead of searching for approximately matching strings (difficult, quadratic) search for perfectly matching substrings (easy, linear).
Extend obtained perfect matches to obtain longer approximate matches that are locally optimal.
This is the idea behind probably the most important bioinformatics tool: Basic Local Alignment Search Tool (Altschul, S., Gish, W., Miller, W., Myers, E. & Lipman, D.J.), 1990
Three primary questions: How to select L? How to extend the seed?How to confirm that the match is biologically relevant?
Tuesday, May 4, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]]
An Introduction to Bioinformatics Algorithms www.bioalgorithms.info
!"#$%&'()*+,-./&01*2-345&
Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60
+++DN +G + IR L G+K I+ L+ E+ RG++K
Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD
keyword
GVK 18
GAK 16
GIK 16
GGK 14
GLK 13
GNK 12
GRK 11
GEK 11
GDK 11
neighborhood
score threshold (T = 13)
Neighborhood
words
High-scoring Pair (HSP)
extension
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SELECTING SEED SIZE L
If strings X and Y (each length n), match with k<n mismatches, then the longest perfect match between them has at least ceil (n/(k+1)) characters.
Easy to show by the following observation: if there are k+1 bins and k objects then at least one of the bins will be empty.
Partition the strings into k+1 equal length substrings -- at least one of them will have no mismatches.
In fact, if the longest perfect match is expected to be quite a bit longer (at least if the mismatches are randomly distributed), e.g. about 40 for n = 100, k = 5 (expected minimum is 17).
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
SELECTING SEED SIZE L
Smaller L: easier to find, but decreased performance, and, importantly, specificity – two random sequences are more likely to have a short common substring
Larger L: could miss out many potential matches, leading to decreased sensitivity.
By default BLAST uses L (w, word size) of 3 for protein sequences and 11 for nucleotide sequences.
MEGABLAST (a faster version of BLAST for similar sequences) uses longer seeds.
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
HOW TO EXTEND THE MATCH?
Gapped local alignment (blastn)
Simple (gapless) extension (original BLAST)
Greedy X-drop alignment (MEGABLAST)
...
A tradeoff between speed and accuracy
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
HOW TO SCORE MATCHES?Biological sequences are not random
some letters are more frequent than others (e.g. in HIV-1 40% of the genome is A)
some mismatches are more common than others in homologous sequences (e.g. due to selection, chemical properties of the residues etc), and should be weighed differently.
BLAST introduces a weighting function on residues: δ(i,j) which assigns a score to a pair of residues.
For nucleotides it is 5 for i=j and -4 otherwise.
For proteins it is based on a large training dataset of homologous sequences (Point Accepted Mutations matrices). PAM120 is roughly equivalent to substitutions accumulated over 120 million years of evolution in an average protein
AA
R
R
N
N
D
D
C
C
Q
Q
E
E
G
G
H
H
I
I
L
L
K
K
M
M
F
F
P
P
S
S
T
T
W
W
Y
Y
V
V
HIV-WITHIN
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
HOW TO COMPUTE SIGNIFICANCE?
Before a search is done we need to decide what a good cutoff value H for a match is.
It is determined by computing the probability that two random sequences will have at least one match scoring H or greater.
Uses Altschul-Dembo-Karlin statistics (1990-1991)
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
STATISTICS OF SCORESGiven a segment pair H between two sequences, comprised of r-character substrings T1 and T2, we compute the score of the H as:
We are interested in finding out how likely the maximal score for any segment pair of two random sequences is to exceed some threshold X
Dembo and Karlin (1990) showed that
The mean value for the maximum score between two segment pairs of two random sequences (lengths n and m), assuming a few things about δ(i,j)), is approximately
s(H) =r�
i=1
δ(T1[i], T2[i])
M = log(nm)/λ∗SOLVES �
i,j
piqj exp(λδ(pi, qj))=0
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
STATISTICS OF SCORES (CONT’D)
For biological sequences, high scoring real matches should greatly exceed the random expectation and the probability that this happens (x is the difference between the mean and the expectation) is
K and λ are expressions that depend on the scoring matrix and letter frequencies, and the distribution is similar to other extreme value distributions.
One can show that the expected number of HSPs – high scoring segment pairs, exceeding the threshold S’ is
Prob{S(H) > x + mean} ≤ K∗ exp(−λ
∗x)
E� = Kmne−λS�
Tuesday, May 4, 2010
CSE/BIMM/BENG 181, SPRING 2010 SERGEI L KOSAKOVSKY POND [[email protected]]
Random
Mutated
0
500
1000
10 15 20 25Log(mn)
Mean HSP
2. Secondly, the number of scores exceeding the mean are supposed to follow a Poissondistribution, or decay exponentially as the function of x = score−Mexpected. Considerthe simulation based on sequences of length 217. As you move away from the mean,the number of replicates scoring x points above the mean drops exponentially.
0
100
200
300
400
75 80 85 90 95 100Score
Count
Random
Mutated
0
500
1000
10 15 20 25Log(mn)
Mean HSP
2. Secondly, the number of scores exceeding the mean are supposed to follow a Poissondistribution, or decay exponentially as the function of x = score−Mexpected. Considerthe simulation based on sequences of length 217. As you move away from the mean,the number of replicates scoring x points above the mean drops exponentially.
0
100
200
300
400
75 80 85 90 95 100Score
Count
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
E-VALUESBecause thresholds are determined by the algorithm internally, it is better to ‘normalize’ the result as follows:
S =λS� − log K
log 2E = nm2−S
BIT SCORE E-VALUE
exp−E Ek/k!POISSON DISTRIBUTION FOR THE NUMBER K OF HSPS WITH SCORES ≥ SPROBABILITY OF FINDING AT LEAST ONE:
1− exp−E
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
TIMELINE
1970: Needleman-Wunsch global alignment algorithm
1981: Smith-Waterman local alignment algorithm
1985: FASTA
1990: BLAST (basic local alignment search tool)
2000s: BLAST has become too slow in “genome vs. genome” comparisons - new faster algorithms evolve!
BLAT
Pattern Hunter
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BLAT VS. BLAST
BLAT (BLAST-Like Alignment Tool): same idea as BLAST - locate short sequence hits and extend (developed by J Kent at UCSC)
BLAT builds an index of the database and scans linearly through the query sequence, whereas BLAST builds an index of the query sequence and then scans linearly through the database
Index is stored in RAM resulting in faster searches
Longer K-mers and greedier extensions specifically designed for highly similar sequences (e.g > 95% nucleotide, >85% protein)
Tuesday, May 4, 2010
SERGEI L KOSAKOVSKY POND [[email protected]]CSE/BIMM/BENG 181, SPRING 2010
BLAT INDEXING
Here is an example with k = 3:
Genome: cacaattatcacgaccgc3-mers (non-overlapping): cac aat tat cac gac cgcIndex: aat 3 gac 12 cac 0,9 tat 6 cgc 15
cDNA (query sequence): aattctcac3-mers (overlapping): aat att ttc tct ctc tca cac 0 1 2 3 4 5 6
Hits: aat 4 cac 1,10 clump: cacAATtatCACgaccgc
Multiple instances map to
single index!
Position of 3-mer in query, genome
Tuesday, May 4, 2010