introduction to probabilistic sequence models: theory and applications

44
Introduction to Probabilistic Sequence Models: Theory and Applications David H. Ardell,Forskarassistent

Upload: mandar

Post on 22-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

David H. Ardell,Forskarassistent. Introduction to Probabilistic Sequence Models: Theory and Applications. Lecture Outline: Intro. to Probabilistic Sequence Models. Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction to Probabilistic Sequence Models: Theory and Applications

Introduction to Probabilistic Sequence Models:

Theory and Applications

David H. Ardell,Forskarassistent

Page 2: Introduction to Probabilistic Sequence Models: Theory and Applications

Lecture Outline: Intro. to Probabilistic Sequence Models

Motif Representations: Consensus Sequences, Motifs and Blocks, Regular Expressions

Probabilistic Sequence Models: profiles, HMMs, SCFG

Page 3: Introduction to Probabilistic Sequence Models: Theory and Applications

Consensus sequences revisited

Consense sequences make poor summaries

A T C G

Page 4: Introduction to Probabilistic Sequence Models: Theory and Applications

A motif is a short stretch of protein sequence associated with a particular function (R. Doolittle, 1981)

The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins

[GA]x(4)GK[ST]

A variety of databases of such motifs exist: such as BLOCKS, PROSITE, and PRINTS, and there are many tools to search proteins for matches to blocks.

Page 5: Introduction to Probabilistic Sequence Models: Theory and Applications

Introduction to Regular Expressions (Regexes)

Regular Expressions specify sets of sequences that match a pattern.

Ex: a[bc]a matches "aba" and "aca"

In addition to literals like a and b in the last example, regular expressions provide quantifiers like * (0 or more), + (1 or more), ? (0 or 1) and {N,M} (between N and M):

Ex: a[bc]*a matches "aa", "aba", "acca", "acbcba" etc

As well as grouping constructions like character classes [xy], compound literals like (this)+, and logical relations, like | which means "or" in (this|that)

Anchors match the beginning ^ and end $ of strings

Page 6: Introduction to Probabilistic Sequence Models: Theory and Applications

IUPAC DNA ambiguity codes as reg-ex classes

Pyrimidines Y = [CT]

PuRines R = [AG]

Strong S = [CG]

Weak W = [AT]

Keto K = [GT]

aMino M = [AC]

B B = [CGT] (one letter greater than A=not-A)

D D = [AGT]

H H = [ACT]

V V = [ACG]

Any base N = [ACGT]

Page 7: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

Begina [bc] a

End

Ex: a[bc]+a matching "ghghgacbah"

[bc]

[^bc]

[^bc][^a]

Page 8: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

ghstu…a [bc] a

End

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]

Page 9: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

hstua…a [bc] a

End

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]

Page 10: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

stuac…a [bc] a

End

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]

Page 11: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

tuacb…a [bc] a

End

Ex: a[bc]+a matching "ghstugacbah"

[bc]

[^bc]

[^bc][^a]

Page 12: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

uacbaha [bc] a

End

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]

Page 13: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

acbaha [bc]

[bc]

aEnd

[^bc]

[^bc]

Ex: a[bc]+a matching "ghstuacbah"

[^a]

Page 14: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

Begina [bc] a

End

Ex: a[bc]+a matching "ghstuacbah"

cbah

[bc]

[^bc]

[^bc][^a]

Page 15: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

a [bc] aEnd

Ex: a[bc]+a matching "ghstuacbah"

bah

[bc]

[^bc]

[^bc][^a]

Page 16: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

a [bc] aEnd

Ex: a[bc]+a matching "ghstuacbah"

ah

[bc]

[^bc]

[^bc][^a]

Page 17: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

a [bc] ah

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]

Page 18: Introduction to Probabilistic Sequence Models: Theory and Applications

Regular Expressions are like machines that eat sequences one letter at a time

a [bc] aMATCH!

Ex: a[bc]+a matching "ghstuacbah"

[bc]

[^bc]

[^bc][^a]

Page 19: Introduction to Probabilistic Sequence Models: Theory and Applications

Motifs are almost always either too selective or too specific

The first described, and most prominent example is the P-loop that binds phosphate in ATP/GTP-binding proteins

[GA]x(4)GK[ST]

Prob. of this motif ≈ (1/10)(1/20)(1/20)(1/10) = 0.000025

Expected number of matches in database with 3.2 x108 residues: about 8000!

About half of the proteins that match this motif are not NTPases of the P-loop class. (Lack of specificity)

Page 20: Introduction to Probabilistic Sequence Models: Theory and Applications

Motifs are almost always either too selective or too specific

[GA]x(4)GK[ST]

Larger and larger alignments of true members of the classgive more and more exceptions to the rule (lack of sensitivity)

Extending the rule ([GAT]x(4)[GAF][KTL][STG]) leads to loss of specificity

Page 21: Introduction to Probabilistic Sequence Models: Theory and Applications

A better way to model motifs

REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.

PROFILES (Position-Specific Score Matrices)

Page 22: Introduction to Probabilistic Sequence Models: Theory and Applications

Profiles turn alignments into probabilistic models

Page 23: Introduction to Probabilistic Sequence Models: Theory and Applications

A graphical view of the same profile:

CCGTL…CGHSV…GCGSL…CGGTL…CCGSS…

G

C

H

GS

T

…C

GS

L

M

Page 24: Introduction to Probabilistic Sequence Models: Theory and Applications

You can also allow for unobserved residues or bases in a profile by giving them small probabilities:

G

A

T

GC

T

…A

GC

T

A

C

TG

Page 25: Introduction to Probabilistic Sequence Models: Theory and Applications

The probability that a sequence matches a profile P is the product of its parts:

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

A0.8

G0.2 C

0.2

T0.6

A0.1

Ex: p(AAGCT | P) = p(A) x p(A) x p(G) x p(C) x p(T) = 0.8 x 0.7 x 0.8 x 0.7 x

0.6 = 0.18

P

Page 26: Introduction to Probabilistic Sequence Models: Theory and Applications

In practice, we compare this probability to that of matching a null model

G

A

T

GC

T

A

G C

T

A

G

A

C

T

G

A

C

T

G

A

C

T

G

A

C

T

G

A

C

T

Page 27: Introduction to Probabilistic Sequence Models: Theory and Applications

The null model is usually based on a composition.

G0.25

A0.25

C0.25

T0.25

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

A0.8

G0.2 C

0.2

T0.6

A0.1

No positional information need be taken into account.

Page 28: Introduction to Probabilistic Sequence Models: Theory and Applications

Example: probabilities of AAGCT with the two models

G0.25

A0.25

C0.25

T0.25

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

…A0.8

G0.2 C

0.2

T0.6

A0.1

p = 0.18

p = 0.255 = 0.00098

Page 29: Introduction to Probabilistic Sequence Models: Theory and Applications

Example: odds ratio of AAGCT with the two models

G0.25

A0.25

C0.25

T0.25

G0.1

A0.7

T0.1

G0.8 C

0.7

T0.2

…A0.8

G0.2 C

0.2

T0.6

A0.1

p = 0.18

p = 0.255 = 0.00098

The odds ratio is 0.18 / 0.00098 ≈ 184. It is 184 times more likely that AAGCT matches the profile than the null model!

Page 30: Introduction to Probabilistic Sequence Models: Theory and Applications

Like with substitution scoring matrices, we prefer the log-odds as a profile score

log2

Pr(AAGCT |P)

Pr(AAGCT | null)= log2(

0.18

0.00098) = log2(184) = 7.5

A positive log-odds (score) indicates a match.

Page 31: Introduction to Probabilistic Sequence Models: Theory and Applications

Digression: interpreting BLAST results

The bit score is a scaled log-odds of homology versus chance

Page 32: Introduction to Probabilistic Sequence Models: Theory and Applications

Digression: interpreting BLAST results

E value is the expected number of hits with scores at least S

Page 33: Introduction to Probabilistic Sequence Models: Theory and Applications

A better way to model motifs

REGULAR EXPRESSIONS “(TTR[ATC]WT) N{15,22} (TRWWAT)”Can find alternative members of a classTreat alternative character states as equally likely.Treat all spacer lengths as equally likely.

PROFILES (Position-Specific Score Matrices)Turn a multiple sequence alignment into a multidimensional (by

position) multinomial distribution.Explicit accounting of observed character statesCannot handle gaps (separate models must be made for different

spacer length -- O’Neill and Chiafari 1989)Can't be used to make alignments

Page 34: Introduction to Probabilistic Sequence Models: Theory and Applications

Hidden Markov Models

A Hidden Markov Model is a machine that can either parse or emit a family of sequences according to a Markov model

The same symbols can put the machine in different states, (A,C,T,G can be in a promoter, a codon, a terminator, etc.) so we say the states are “hidden”

Example: The Dice Factory

P(2) = 1/6

P(1) = 1/6

P(3) = 1/6

P(4) = 1/6

P(5) = 1/6

P(6) = 1/6

P(2) = 1/10

P(1) = 3/6

P(3) = 1/10

P(4) = 1/10

P(5) = 1/10

P(6) = 1/10

FAIR BIASED

0.99 0.70

0.01

0.30

...11452161621233453261432152211121611112211...

GENERATED

PREDICTED

Page 35: Introduction to Probabilistic Sequence Models: Theory and Applications

A Profile HMM is a profile with gaps

G

AT

G C

TA

G C

T

A

Page 36: Introduction to Probabilistic Sequence Models: Theory and Applications

A Profile HMM is a profile with gaps

G

AT

G C

TA

G C

T

A

insertions

Page 37: Introduction to Probabilistic Sequence Models: Theory and Applications

A Profile HMM is a profile with gaps

G

AT

G C

TA

G C

T

A

deletions

Page 38: Introduction to Probabilistic Sequence Models: Theory and Applications

A Profile HMM is a profile with gaps

G

AT

G C

TA

G C

T

A

insertions

deletions

Page 39: Introduction to Probabilistic Sequence Models: Theory and Applications

The HMMer Null Model (composition of insertions may be set by user, eg to match genome)

G0.25

A0.25

C0.25

T0.25

Page 40: Introduction to Probabilistic Sequence Models: Theory and Applications

The Plan 7 architecture in HMMer

Permit local matches to sequence

Permit repeated matches to sequence

Permit local matches to model

Page 41: Introduction to Probabilistic Sequence Models: Theory and Applications

HMMer2 (pronounced 'hammer', as in, “Why BLAST if you can hammer?”)

Page 42: Introduction to Probabilistic Sequence Models: Theory and Applications

The HMMer2 design separates models from algorithms

With the same alignment or model design, you can easily change the search algorithm (encoded in the HMM) to do:

Multihit Global alignments of model to sequence

Multihit Smith-Waterman (local with respect to both model and sequence, multiple non-overlapping hits to sequence allowed)

Single (best) hit variants of both of the above.

Page 43: Introduction to Probabilistic Sequence Models: Theory and Applications

This separation of model from algorithm provides a ready framework for sequence analysis(programs provided in HMMer)

hmmalign Align sequences to an existing model.

hmmbuild Build a model from a multiple sequence alignment.

hmmcalibrate Takes an HMM and empirically determines parameters that are used to make searches more sensitive, by calculating more accurate expectation value scores (E-values).

hmmconvert Convert a model file into different formats, including a compact HMMER 2 binary format, and “best effort” emulation of GCG profiles.

hmmemit Emit sequences probabilistically from a profile HMM.

hmmfetch Get a single model from an HMM database.

hmmindex Index an HMM database.

hmmpfam Search an HMM database for matches to a query sequence.

hmmsearch Search a sequence database for matches to an HMM.

Page 44: Introduction to Probabilistic Sequence Models: Theory and Applications

HMMer2 format can be automatically converted for use with SAM