pattern and string matching tools

27
Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004

Upload: platt

Post on 11-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Pattern and string matching tools. Biology 162 Computational Genetics Todd Vision 9 Sep 2004. Some more pattern and string matching tools. Simple signatures Logos Position-specific Scoring Matrices PSI-BLAST Regular expressions Suffix trees. Sequence logos. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Pattern and string matching tools

Pattern and string matching tools

Biology 162 Computational Genetics

Todd Vision9 Sep 2004

Page 2: Pattern and string matching tools

Some more pattern and string matching tools

• Simple signatures– Logos– Position-specific Scoring Matrices– PSI-BLAST

• Regular expressions• Suffix trees

Page 3: Pattern and string matching tools
Page 4: Pattern and string matching tools

Sequence logos

• Entropy of column j denoted Hj

• Information content denoted Ij

• How to draw a logo– Height of column given by Ij– Height of each symbol = fij x Ij€

H j = − f ij log2( f ij )i

I j = log2 20 −H j

Page 5: Pattern and string matching tools

Information content

• Information/Uncertainty is expressed in bits– There is a natural relationship to log base 2

• Imagine 64 shells, under one of which is a ball.– 6 guesses are required to find the ball

– In this case, maximal uncertainty is log264=6 bits

• In the case of 20 amino acids, maximal uncertainty is log220=4.32 bits.

Page 6: Pattern and string matching tools

Position-Specific Scoring Matrix

• Constructed from conserved columns of a MSA

• Log odds scores for each residue in each column, based on– Frequency of residue within column– Background frequency of residues

• Takes advantage of the fact that columns differ in– Composition– Levels of conservation

Page 7: Pattern and string matching tools

Position Specific Scoring Matrix

pos con A R N D C … A R N D C … Inf Pseu 1 M -1 -3 -3 -4 -1 … 0 0 0 0 0 … 0.50 0.16 2 W -3 -3 -4 -5 -3 … 0 0 0 0 0 … 2.32 0.26 3 I -1 -3 -2 -3 7 … 0 0 0 36 0 … 0.71 0.26 4 L -2 -3 -2 -3 -3 … 0 0 0 0 0 … 0.47 0.35 5 A 4 -2 -2 -2 -2 … 56 0 0 0 0 … 0.52 0.35

PSI-BLAST PSSM for DSCAM

Page 8: Pattern and string matching tools

Pseudocounts• If a residue is never seen in a particular

column in of a MSA– What is the probability of ever seeing it there?– Not really zero…

• Pseudocounts are added to actual counts to account for uncertaintly in column frequencies

• Many methods– Laplace’s Rule

• Add one to every count• Psudocounts grow less important as sample size gets

large

– Methods related to Bayesian priors - we will see later

Page 9: Pattern and string matching tools

Calculating scores in a PSSM

• Sij is score for residue i at position j

• xij is position-specific count of residue i

• fi is background frequency of residue i

• bij are pseudocounts

• N sequences in alignment

Sij = log2 x ij + bij( ) N + biji

∑ ⎛

⎝ ⎜

⎠ ⎟

−1

f i

⎣ ⎢ ⎢

⎦ ⎥ ⎥

Page 10: Pattern and string matching tools

PSI-BLAST

• Can identify more distant homologs than possible via pairwise BLAST

• Iterative BLAST– After 1st iteration, multiple alignment is

computed for query and top matches– PSSM generated from alignment– PSSM used for subsequent iterations– PSSM refined each iteration

Page 11: Pattern and string matching tools

PSI-BLAST

• Once high-scoring words are generated from PSSM, algorithm proceeds as before– Still very fast

• and K must be recalculated for each iteration

Page 12: Pattern and string matching tools

Regular Expressions (regex)

• Can be thought of as a non-probabilistic rule for generating (or matching) a pattern

• Used for– DNA/Protein signatures (e.g. Prosite)– Text parsing (e.g. in Perl)

Page 13: Pattern and string matching tools

Prosite regexesID CBD_FUNGAL; PATTERN.AC PS00562;DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).DE Cellulose-binding domain, fungal type.PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C

In Perl regex syntax:CGG\w{4,7}G\w{3}C\w{5}C\w{3,5}[NHG]\w[FYWM]\w{2}QC

In words:C followed by G followed by G followed by any 4 to 7 letters

followed by G followed by any 3 letters followed by C followed by any 5 letters followed by C followed by an 3 to 5 letters followed by one of N, H or G, followed by any letter followed by one of F, Y, W, or M followed by any two letters followed by Q followed by C

Page 14: Pattern and string matching tools

Perl regex metacharacters• [ ] - character class (e.g. [abc] = a, b or c)• {min, max} - quantifiers• {exactly}• * - repetition, zero or more• + - repetition, one or more• ? - optional, zero or one• . - wildcard (any character)• ( ) - capture or delimit substrings• | - alternation (e.g. (a|b) = either a or b)

Page 15: Pattern and string matching tools

Regular expressions

Pattern Matchesa[bc]d abd, acdab{2,5}c abc, abbc, …

abbbbbcab*c ac, abc, abbc, …ab+c abc, abbc, …ab?c ac, abca(bc|de) abc, ade

Page 16: Pattern and string matching tools

Regular expressions: limitations

• Non-probabilistic: all matches match equally well– Hidden Markov models improve upon this

• Cannot model dependencies among different positions– Neither can HMMs– For RNA matches, where dependencies

matter, we need to allow more complex rules

Page 17: Pattern and string matching tools

Chomsky hierarchy of transformational

grammars: a preview

• General theory for modelling strings of symbols used in linguistics– Regular grammars– Context-free grammars– Context-sensitive grammars– Unrestricted grammars

• Regular grammars (like regexes) are easy to parse, but are structurally limited

• We will see context sensitive grammars for modelling RNA sequences

Page 18: Pattern and string matching tools

Suffix Trees

• Data structure used for fast matching of sequence patterns

• Helps to explain how BLAST can find word matches so fast

• Commonly used for – Exact matching– Identifying repeated sequences

Page 19: Pattern and string matching tools

Suffix Trees

• Rooted, directed tree for string S• |S| = m leaves, labeled 1..m• Edges labelled with substrings of S• Internal node has at most one

edge for each symbol in alphabet• Concatenation of edge labels on

path from root to leaf i equals suffix S[1..m]

Page 20: Pattern and string matching tools

Suffix Trees: An Example

S = ‘gatgac’

root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

Page 21: Pattern and string matching tools

Least common ancestor• LCA corresponds to shared prefix

of suffix (e.g. path labeled ‘ga’ for nodes 1 and 4)

• LCA can be retrieved in constant time

root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

Page 22: Pattern and string matching tools

If suffix trees are the answer, what is the

question?• Rapid word matching• Find all occurrences of ‘ga’ in S =

‘gatgac’ root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

Page 23: Pattern and string matching tools

If suffix trees are the answer, what is the

question?• Longest common substring problem• Find the starting positions, length and

identity of the longest substring that occurs in both S1 and S2

S1 = ‘gatgac’

S2 = ‘gatcac’

root

3 6 5 2 4 1

gac

c

a

cgac

ga

cacc

1

t

gac2

t

cac

3

cac

t

4

ac

56

Page 24: Pattern and string matching tools

If suffix trees are the answer, what is the

question?• Find all direct palindromes (a substring

concatenated with its reverse) in S=‘agattagct’ • Observation

– Let Sr=‘tcgattaga’

– If a palindrome is centered between q and q+1 of S, then it is also centered between m-q and m-q+1 of Sr.

• Solution– Construct joint suffix tree for S and Sr, find least

common ancestor for all pairs q+1, n-q+1

Page 25: Pattern and string matching tools

Myriad uses for suffix trees

• Direct and inverted repeats– Microsatellites– Transposons

• Inverted palindromes– Restriction enzyme recognition sites

• Imperfect matches

• Algorithmic efficiency – Many efficient algorithms for traversing suffix trees– The trees themselves can be constructed in O(m)

time

Page 26: Pattern and string matching tools

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 27: Pattern and string matching tools

Reading assignment(for Tuesday and

Thursday)• Durbin et al. (1998) pgs. 46-79 in

Biological Sequence Analysis. – Markov chains– Hidden Markov models