Download - Pattern and string matching tools

Pattern and string matching tools

Biology 162 Computational Genetics

Todd Vision9 Sep 2004

Some more pattern and string matching tools

• Simple signatures– Logos– Position-specific Scoring Matrices– PSI-BLAST

• Regular expressions• Suffix trees

Sequence logos

• Entropy of column j denoted Hj

• Information content denoted Ij

• How to draw a logo– Height of column given by Ij– Height of each symbol = fij x Ij€

H j = − f ij log2( f ij )i

∑

I j = log2 20 −H j

Information content

• Information/Uncertainty is expressed in bits– There is a natural relationship to log base 2

• Imagine 64 shells, under one of which is a ball.– 6 guesses are required to find the ball

– In this case, maximal uncertainty is log264=6 bits

• In the case of 20 amino acids, maximal uncertainty is log220=4.32 bits.

Position-Specific Scoring Matrix

• Constructed from conserved columns of a MSA

• Log odds scores for each residue in each column, based on– Frequency of residue within column– Background frequency of residues

• Takes advantage of the fact that columns differ in– Composition– Levels of conservation

Position Specific Scoring Matrix

pos con A R N D C … A R N D C … Inf Pseu 1 M -1 -3 -3 -4 -1 … 0 0 0 0 0 … 0.50 0.16 2 W -3 -3 -4 -5 -3 … 0 0 0 0 0 … 2.32 0.26 3 I -1 -3 -2 -3 7 … 0 0 0 36 0 … 0.71 0.26 4 L -2 -3 -2 -3 -3 … 0 0 0 0 0 … 0.47 0.35 5 A 4 -2 -2 -2 -2 … 56 0 0 0 0 … 0.52 0.35

PSI-BLAST PSSM for DSCAM

Pseudocounts• If a residue is never seen in a particular

column in of a MSA– What is the probability of ever seeing it there?– Not really zero…

• Pseudocounts are added to actual counts to account for uncertaintly in column frequencies

• Many methods– Laplace’s Rule

• Add one to every count• Psudocounts grow less important as sample size gets

large

– Methods related to Bayesian priors - we will see later

Calculating scores in a PSSM

• Sij is score for residue i at position j

• xij is position-specific count of residue i

• fi is background frequency of residue i

• bij are pseudocounts

• N sequences in alignment

€

Sij = log2 x ij + bij( ) N + biji

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

−1

f i

⎡

⎣ ⎢ ⎢

⎤

⎦ ⎥ ⎥

PSI-BLAST

• Can identify more distant homologs than possible via pairwise BLAST

• Iterative BLAST– After 1st iteration, multiple alignment is

computed for query and top matches– PSSM generated from alignment– PSSM used for subsequent iterations– PSSM refined each iteration

PSI-BLAST

• Once high-scoring words are generated from PSSM, algorithm proceeds as before– Still very fast

• and K must be recalculated for each iteration

Regular Expressions (regex)

• Can be thought of as a non-probabilistic rule for generating (or matching) a pattern

• Used for– DNA/Protein signatures (e.g. Prosite)– Text parsing (e.g. in Perl)

Prosite regexesID CBD_FUNGAL; PATTERN.AC PS00562;DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).DE Cellulose-binding domain, fungal type.PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C

In Perl regex syntax:CGG\w{4,7}G\w{3}C\w{5}C\w{3,5}[NHG]\w[FYWM]\w{2}QC

In words:C followed by G followed by G followed by any 4 to 7 letters

followed by G followed by any 3 letters followed by C followed by any 5 letters followed by C followed by an 3 to 5 letters followed by one of N, H or G, followed by any letter followed by one of F, Y, W, or M followed by any two letters followed by Q followed by C

Perl regex metacharacters• [ ] - character class (e.g. [abc] = a, b or c)• {min, max} - quantifiers• {exactly}• * - repetition, zero or more• + - repetition, one or more• ? - optional, zero or one• . - wildcard (any character)• ( ) - capture or delimit substrings• | - alternation (e.g. (a|b) = either a or b)

Regular expressions

Pattern Matchesa[bc]d abd, acdab{2,5}c abc, abbc, …

abbbbbcab*c ac, abc, abbc, …ab+c abc, abbc, …ab?c ac, abca(bc|de) abc, ade

Regular expressions: limitations

• Non-probabilistic: all matches match equally well– Hidden Markov models improve upon this

• Cannot model dependencies among different positions– Neither can HMMs– For RNA matches, where dependencies

matter, we need to allow more complex rules

Chomsky hierarchy of transformational

grammars: a preview

• General theory for modelling strings of symbols used in linguistics– Regular grammars– Context-free grammars– Context-sensitive grammars– Unrestricted grammars

• Regular grammars (like regexes) are easy to parse, but are structurally limited

• We will see context sensitive grammars for modelling RNA sequences

Suffix Trees

• Data structure used for fast matching of sequence patterns

• Helps to explain how BLAST can find word matches so fast

• Commonly used for – Exact matching– Identifying repeated sequences

Suffix Trees

• Rooted, directed tree for string S• |S| = m leaves, labeled 1..m• Edges labelled with substrings of S• Internal node has at most one

edge for each symbol in alphabet• Concatenation of edge labels on

path from root to leaf i equals suffix S[1..m]

Suffix Trees: An Example

S = ‘gatgac’

root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

Least common ancestor• LCA corresponds to shared prefix

of suffix (e.g. path labeled ‘ga’ for nodes 1 and 4)

• LCA can be retrieved in constant time

root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc

If suffix trees are the answer, what is the

question?• Rapid word matching• Find all occurrences of ‘ga’ in S =

‘gatgac’ root

3 6 5 2 4 1

tgac

c

a

c tgac

ga

tgacc


question?• Longest common substring problem• Find the starting positions, length and

identity of the longest substring that occurs in both S1 and S2

S1 = ‘gatgac’

S2 = ‘gatcac’

root

3 6 5 2 4 1

gac

c

a

cgac

ga

cacc

1

t

gac2

t

cac

3

cac

t

4

ac

56


question?• Find all direct palindromes (a substring

concatenated with its reverse) in S=‘agattagct’ • Observation

– Let Sr=‘tcgattaga’

– If a palindrome is centered between q and q+1 of S, then it is also centered between m-q and m-q+1 of Sr.

• Solution– Construct joint suffix tree for S and Sr, find least

common ancestor for all pairs q+1, n-q+1

Myriad uses for suffix trees

• Direct and inverted repeats– Microsatellites– Transposons

• Inverted palindromes– Restriction enzyme recognition sites

• Imperfect matches

• Algorithmic efficiency – Many efficient algorithms for traversing suffix trees– The trees themselves can be constructed in O(m)

time

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Reading assignment(for Tuesday and

Thursday)• Durbin et al. (1998) pgs. 46-79 in

Biological Sequence Analysis. – Markov chains– Hidden Markov models

Download - Pattern and string matching tools

Top Related