introduction to probabilistic sequence models: theory and applications

Introduction to Probabilistic Sequence Models:

Theory and Applications

David H. Ardell,Forskarassistent

Lecture Outline: Intro. to Probabilistic Sequence Models

Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions

Probabilistic Sequence Models: profiles, HMMs, SCFG

Consensus sequences revisited

Consense sequences make poor summaries

A T C G

A motif is a short stretch of protein sequence associated with a particular function (R. Doolittle, 1981)

The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins

[GA]x(4)GK[ST]

A variety of databases of such motifs exist: such as BLOCKS, PROSITE, and PRINTS, and there are many tools to search proteins for matches to blocks.

Introduction to Regular Expressions (Regexes)

Regular Expressions specify sets of sequences that match a pattern.

Ex: a[bc]a matches "aba" and "aca"

In addition to literals like a and b in the last example, regular expressions provide quantifiers like * (0 or more), + (1 or more), ? (0 or 1) and {N,M} (between N and M):

Ex: a[bc]*a matches "aa", "aba", "acca", "acbcba" etc

As well as grouping constructions like character classes [xy], compound literals like (this)+, and logical relations, like | which means "or" in (this|that)

Anchors match the beginning ^ and end $ of strings

IUPAC DNA ambiguity codes as reg-ex classes

Pyrimidines Y = [CT]

PuRines R = [AG]

Strong S = [CG]

Weak W = [AT]

Keto K = [GT]

aMino M = [AC]

B B = [CGT] (one letter greater than A=not-A)

D D = [AGT]

H H = [ACT]

V V = [ACG]

Any base N = [ACGT]

Regular Expressions are like machines that eat sequences one letter at a time

Begina [bc] a

End

Ex: a[bc]+a matching "ghghgacbah"

[bc]

[^bc]

[^bc][^a]


ghstu…a [bc] a

End

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]


hstua…a [bc] a

End


[bc]

[^bc]

[^bc][^a]


stuac…a [bc] a

End


[bc]

[^bc]

[^bc][^a]


tuacb…a [bc] a

End

Ex: a[bc]+a matching "ghstugacbah"

[bc]

[^bc]

[^bc][^a]


uacbaha [bc] a

End


[bc]

[^bc]

[^bc][^a]


acbaha [bc]

[bc]

aEnd

[^bc]

[^bc]


[^a]


Begina [bc] a

End


cbah

[bc]

[^bc]

[^bc][^a]


a [bc] aEnd


bah

[bc]

[^bc]

[^bc][^a]


a [bc] aEnd


ah

[bc]

[^bc]

[^bc][^a]


a [bc] ah


[bc]

[^bc]

[^bc][^a]


a [bc] aMATCH!


[bc]

[^bc]

[^bc][^a]

Motifs are almost always either too selective or too specific

The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins

[GA]x(4)GK[ST]

Prob. of this motif ≈ (1/10)(1/20)(1/20)(1/10) = 0.000025

Expected number of matches in database with 3.2 x108 residues: about 8000!

About half of the proteins that match this motif are not NTPases of the P-loop class. (Lack of specificity)

Motifs are almost always either too selective or too specific

[GA]x(4)GK[ST]

Larger and larger alignments of true members of the classgive more and more exceptions to the rule (lack of sensitivity)

Extending the rule ([GAT]x(4)[GAF][KTL][STG]) leads to loss of specificity

A better way to model motifs

REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.

PROFILES (Position-Specific Score Matrices)

Profiles turn alignments into probabilistic models

A graphical view of the same profile:

CCGTL…CGHSV…GCGSL…CGGTL…CCGSS…

G

C

H

GS

T

…C

GS

L

M

You can also allow for unobserved residues or bases in a profile by giving them small probabilities:

G

A

T

GC

T

…A

GC

T

A

C

TG

The probability that a sequence matches a profile P is the product of its parts:

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

A0.8

G0.2 C

0.2

T0.6

A0.1

Ex: p(AAGCT | P) = p(A) x p(A) x p(G) x p(C) x p(T) = 0.8 x 0.7 x 0.8 x 0.7 x

0.6 = 0.18

P

In practice, we compare this probability to that of matching a null model

G

A

T

GC

T

A

G C

T

A

G

A

C

T

G

A

C

T

G

A

C

T

G

A

C

T

G

A

C

T

The null model is usually based on a composition.

G0.25

A0.25

C0.25

T0.25

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

A0.8

G0.2 C

0.2

T0.6

A0.1

No positional information need be taken into account.

Example: probabilities of AAGCT with the two models

G0.25

A0.25

C0.25

T0.25

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

…A0.8

G0.2 C

0.2

T0.6

A0.1

p = 0.18

p = 0.255 = 0.00098

Example: odds ratio of AAGCT with the two models

G0.25

A0.25

C0.25

T0.25

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

…A0.8

G0.2 C

0.2

T0.6

A0.1

p = 0.18

p = 0.255 = 0.00098

The odds ratio is 0.18 / 0.00098 ≈ 184. It is 184 times more likely that AAGCT matches the profile than the null model!

Like with substitution scoring matrices, we prefer the log-odds as a profile score

€

log2

Pr(AAGCT |P)

Pr(AAGCT | null)= log2(

0.18

0.00098) = log2(184) = 7.5

A positive log-odds (score) indicates a match.

Digression: interpreting BLAST results

The bit score is a scaled log-odds of homology versus chance

Digression: interpreting BLAST results

E value is the expected number of hits with scores at least S

A better way to model motifs

REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.

PROFILES (Position-Specific Score Matrices)Turn a multiple sequence alignment into a multidimensional (by

position) multinomial distribution.Explicit accounting of observed character statesCannot handle gaps (separate models must be made for different

spacer length -- O’Neill and Chiafari 1989)Can't be used to make alignments

Hidden Markov Models

A Hidden Markov Model is a machine that can either parse or emit a family of sequences according to a Markov model

The same symbols can put the machine in different states, (A,C,T,G can be in a promoter, a codon, a terminator, etc.) so we say the states are “hidden”

Example: The Dice Factory

P(2) = 1/6

P(1) = 1/6

P(3) = 1/6

P(4) = 1/6

P(5) = 1/6

P(6) = 1/6

P(2) = 1/10

P(1) = 3/6

P(3) = 1/10

P(4) = 1/10

P(5) = 1/10

P(6) = 1/10

FAIR BIASED

0.99 0.70

0.01

0.30

...11452161621233453261432152211121611112211...

GENERATED

PREDICTED

A Profile HMM is a profile with gaps

G

AT

G C

TA

G C

T

A


G

AT

G C

TA

G C

T

A

insertions


G

AT

G C

TA

G C

T

A

deletions


G

AT

G C

TA

G C

T

A

insertions

deletions

The HMMer Null Model (composition of insertions may be set by user, eg to match genome)

G0.25

A0.25

C0.25

T0.25

The Plan 7 architecture in HMMer

Permit local matches to sequence

Permit repeated matches to sequence

Permit local matches to model

HMMer2 (pronounced 'hammer', as in, “Why BLAST if you can hammer?”)

The HMMer2 design separates models from algorithms

With the same alignment or model design, you can easily change the search algorithm (encoded in the HMM) to do:

Multihit Global alignments of model to sequence

Multihit Smith-Waterman (local with respect to both model and sequence, multiple non-overlapping hits to sequence allowed)

Single (best) hit variants of both of the above.

This separation of model from algorithm provides a ready framework for sequence analysis(programs provided in HMMer)

hmmalign Align sequences to an existing model.

hmmbuild Build a model from a multiple sequence alignment.

hmmcalibrate Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values).

hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles.

hmmemit Emit sequences probabilistically from a profile HMM.

hmmfetch Get a single model from an HMM database.

hmmindex Index an HMM database.

hmmpfam Search an HMM database for matches to a query sequence.

hmmsearch Search a sequence database for matches to an HMM.

HMMer2 format can be automatically converted for use with SAM

introduction to probabilistic sequence models: theory and applications

Documents