Download - Pattern and string matching tools
Pattern and string matching tools
Biology 162 Computational Genetics
Todd Vision9 Sep 2004
Some more pattern and string matching tools
• Simple signatures– Logos– Position-specific Scoring Matrices– PSI-BLAST
• Regular expressions• Suffix trees
Sequence logos
• Entropy of column j denoted Hj
• Information content denoted Ij
• How to draw a logo– Height of column given by Ij– Height of each symbol = fij x Ij€
H j = − f ij log2( f ij )i
∑
I j = log2 20 −H j
Information content
• Information/Uncertainty is expressed in bits– There is a natural relationship to log base 2
• Imagine 64 shells, under one of which is a ball.– 6 guesses are required to find the ball
– In this case, maximal uncertainty is log264=6 bits
• In the case of 20 amino acids, maximal uncertainty is log220=4.32 bits.
Position-Specific Scoring Matrix
• Constructed from conserved columns of a MSA
• Log odds scores for each residue in each column, based on– Frequency of residue within column– Background frequency of residues
• Takes advantage of the fact that columns differ in– Composition– Levels of conservation
Position Specific Scoring Matrix
pos con A R N D C … A R N D C … Inf Pseu 1 M -1 -3 -3 -4 -1 … 0 0 0 0 0 … 0.50 0.16 2 W -3 -3 -4 -5 -3 … 0 0 0 0 0 … 2.32 0.26 3 I -1 -3 -2 -3 7 … 0 0 0 36 0 … 0.71 0.26 4 L -2 -3 -2 -3 -3 … 0 0 0 0 0 … 0.47 0.35 5 A 4 -2 -2 -2 -2 … 56 0 0 0 0 … 0.52 0.35
PSI-BLAST PSSM for DSCAM
Pseudocounts• If a residue is never seen in a particular
column in of a MSA– What is the probability of ever seeing it there?– Not really zero…
• Pseudocounts are added to actual counts to account for uncertaintly in column frequencies
• Many methods– Laplace’s Rule
• Add one to every count• Psudocounts grow less important as sample size gets
large
– Methods related to Bayesian priors - we will see later
Calculating scores in a PSSM
• Sij is score for residue i at position j
• xij is position-specific count of residue i
• fi is background frequency of residue i
• bij are pseudocounts
• N sequences in alignment
€
Sij = log2 x ij + bij( ) N + biji
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
−1
f i
⎡
⎣ ⎢ ⎢
⎤
⎦ ⎥ ⎥
PSI-BLAST
• Can identify more distant homologs than possible via pairwise BLAST
• Iterative BLAST– After 1st iteration, multiple alignment is
computed for query and top matches– PSSM generated from alignment– PSSM used for subsequent iterations– PSSM refined each iteration
PSI-BLAST
• Once high-scoring words are generated from PSSM, algorithm proceeds as before– Still very fast
• and K must be recalculated for each iteration
Regular Expressions (regex)
• Can be thought of as a non-probabilistic rule for generating (or matching) a pattern
• Used for– DNA/Protein signatures (e.g. Prosite)– Text parsing (e.g. in Perl)
Prosite regexesID CBD_FUNGAL; PATTERN.AC PS00562;DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).DE Cellulose-binding domain, fungal type.PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C
In Perl regex syntax:CGG\w{4,7}G\w{3}C\w{5}C\w{3,5}[NHG]\w[FYWM]\w{2}QC
In words:C followed by G followed by G followed by any 4 to 7 letters
followed by G followed by any 3 letters followed by C followed by any 5 letters followed by C followed by an 3 to 5 letters followed by one of N, H or G, followed by any letter followed by one of F, Y, W, or M followed by any two letters followed by Q followed by C
Perl regex metacharacters• [ ] - character class (e.g. [abc] = a, b or c)• {min, max} - quantifiers• {exactly}• * - repetition, zero or more• + - repetition, one or more• ? - optional, zero or one• . - wildcard (any character)• ( ) - capture or delimit substrings• | - alternation (e.g. (a|b) = either a or b)
Regular expressions
Pattern Matchesa[bc]d abd, acdab{2,5}c abc, abbc, …
abbbbbcab*c ac, abc, abbc, …ab+c abc, abbc, …ab?c ac, abca(bc|de) abc, ade
Regular expressions: limitations
• Non-probabilistic: all matches match equally well– Hidden Markov models improve upon this
• Cannot model dependencies among different positions– Neither can HMMs– For RNA matches, where dependencies
matter, we need to allow more complex rules
Chomsky hierarchy of transformational
grammars: a preview
• General theory for modelling strings of symbols used in linguistics– Regular grammars– Context-free grammars– Context-sensitive grammars– Unrestricted grammars
• Regular grammars (like regexes) are easy to parse, but are structurally limited
• We will see context sensitive grammars for modelling RNA sequences
Suffix Trees
• Data structure used for fast matching of sequence patterns
• Helps to explain how BLAST can find word matches so fast
• Commonly used for – Exact matching– Identifying repeated sequences
Suffix Trees
• Rooted, directed tree for string S• |S| = m leaves, labeled 1..m• Edges labelled with substrings of S• Internal node has at most one
edge for each symbol in alphabet• Concatenation of edge labels on
path from root to leaf i equals suffix S[1..m]
Suffix Trees: An Example
S = ‘gatgac’
root
3 6 5 2 4 1
tgac
c
a
c tgac
ga
tgacc
Least common ancestor• LCA corresponds to shared prefix
of suffix (e.g. path labeled ‘ga’ for nodes 1 and 4)
• LCA can be retrieved in constant time
root
3 6 5 2 4 1
tgac
c
a
c tgac
ga
tgacc
If suffix trees are the answer, what is the
question?• Rapid word matching• Find all occurrences of ‘ga’ in S =
‘gatgac’ root
3 6 5 2 4 1
tgac
c
a
c tgac
ga
tgacc
If suffix trees are the answer, what is the
question?• Longest common substring problem• Find the starting positions, length and
identity of the longest substring that occurs in both S1 and S2
S1 = ‘gatgac’
S2 = ‘gatcac’
root
3 6 5 2 4 1
gac
c
a
cgac
ga
cacc
1
t
gac2
t
cac
3
cac
t
4
ac
56
If suffix trees are the answer, what is the
question?• Find all direct palindromes (a substring
concatenated with its reverse) in S=‘agattagct’ • Observation
– Let Sr=‘tcgattaga’
– If a palindrome is centered between q and q+1 of S, then it is also centered between m-q and m-q+1 of Sr.
• Solution– Construct joint suffix tree for S and Sr, find least
common ancestor for all pairs q+1, n-q+1
Myriad uses for suffix trees
• Direct and inverted repeats– Microsatellites– Transposons
• Inverted palindromes– Restriction enzyme recognition sites
• Imperfect matches
• Algorithmic efficiency – Many efficient algorithms for traversing suffix trees– The trees themselves can be constructed in O(m)
time
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Reading assignment(for Tuesday and
Thursday)• Durbin et al. (1998) pgs. 46-79 in
Biological Sequence Analysis. – Markov chains– Hidden Markov models