welcome to introduction to computational genomics for infectious disease

Welcome toIntroduction to Computational

Genomics for Infectious Disease

Course Instructors

• Instructor

James Galagan

• Teaching Assistants

• Lab Instructors

Brian Weiner Desmond Lun

Antonis Rokas Mark Borowsky Jeremy Zucker

Reinhard Engels Aaron Brandes Caroline Colijn

Other members of Broad Microbial Analysis Group

Schedule and Logistics• Lectures

• Labs

Tues/Thurs 11-12:30Harvard School of Public Health: FXB-301

The François-Xavier Bagnoud Center, Room 301

Wed/Fri 1-3Broad Institute: Olympus RoomFirst floor of Broad Main Lobby

See front desk attendant near entrance

Individual computers and software providedNo programming experience required

Website

• Contact information• Directions to Broad

• Lecture slides• Lab handouts

• Resources

www.broad.mit.edu/annotation/winter_course_2006/

Goals of Course

• Introduction to concepts behind commonly used computational tools

• Recognize connection between different concepts and applications

• Hands on experience with computational analysis

Concepts and Applications

• Lectures will cover concepts– Computationally oriented

• Labs will provide opportunity for hands on application of tools– Nuts and bolts of running tools– Application of tools not covered in lectures

Computational Genomics Overview

Slide Credit: Manolis Kellis

Topics

1. Probabilistic Sequence Modeling

2. Clustering and Classification

3. Motifs

4. Steady State Metabolic Modeling

Topics Not Covered

• Sequence Alignment• Phylogeny (maybe in labs)• Molecular Evolution• Population Genetics

• Advanced Machine Learning– Bayesian Networks– Conditional Random Fields

Applications to Infectious Disease

• Examples and labs will focus on the analysis of microbial genomics data– Pathogenicity islands– TB expression analysis– Antigen prediction– Mycolic acid metabolism

• But approaches are applicable to any organism and to many different questions

Probabilistic Modeling of Biological Sequences

ConceptsStatistical Modeling of Sequences Hidden Markov Models

ApplicationsPredicting pathogenicity islandsModeling protein families

Lab PracticalBasic sequence annotation

Probabilistic Sequence Modeling

• Treat objects of interest as random variables– nucleotides, amino acids, genes, etc.

• Model probability distributions for these variables

• Use probability calculus to make inferences

Why Probabilistic Sequence Modeling?

• Biological data is noisy

• Probability provides a calculus for manipulating models

• Not limited to yes/no answers – can provide “degrees of belief”

• Many common computational tools based on probabilistic models

Sequence AnnotationGCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGTTCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTTGCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCCCCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACCTGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCCGCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACCGGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCGACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTGTACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCGTATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTGGTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTCATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAATGATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTGGCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTCGCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGATATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAGGTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATCGAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGATCCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGTTCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTTGCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCCCCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACCTGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCCGCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACCGGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCGACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTGTACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCGTATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTGGTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTCATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAATGATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTGGCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTCGCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGATATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAGGTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATCGAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGATCCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

Sequence Annotation

Gene

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGTTCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTTGCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGAAGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGCGTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCCCCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACCTGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCCGCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACCGGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCGACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTGTACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCGTATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTGGTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTCATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAATGATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTGGCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTCGCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGATATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAGGTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATCGAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGATCCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

Sequence Annotation

Gene

Promoter Motif

KinaseDomain

Probabilistic Sequence Modeling

• Hidden Markov Models (HMM)– A general framework for sequences of

symbols (e.g. nucleotides, amino acids)– Widely used in computational genomics

1. Hmmer – HMMs for protein families

2. Pathogenicity Islands

Neisseria meningitidis, 52% G+C

(from Tettelin et al. 2000. Science)

GC Content

Pathogenicity Islands

• Clusters of genes acquired by horizontal transfer– Present in pathogenic species

but not others

• Frequently encode virulence factors– Toxins, secondary

metabolites, adhesins

• (Flanked by repeats, gene content, phylogeny, regulation, codon usage)

• Different GC content than rest of genome

Application: Bacillus subtilis

Modeling Sequence Composition• Calculate sequence distribution from

known islands– Count occurrences of A,T,G,C

• Model islands as nucleotides drawn independently from this distribution

A: 0.15

T: 0.13

G: 0.30

C: 0.42

……

A: 0.15

T: 0.13

G: 0.30

C: 0.42

A: 0.15

T: 0.13

G: 0.30

C: 0.42

P(Si|MP)

... C C TA A G T T A G A G G A T T G A G A ….

The Probability of a Sequence• Can calculate the probability of a particular sequence

(S) according to the pathogenicity island model (MP)

1 21

( | ) ( , ,... | ) ( | )N

N ii

P S MP P S S S MP P S MP

Example

S = AAATGCGCATTTCGAA6 4 3 2

6 4 3 2

11

( | ) ( ) ( ) ( ) ( )

(0.15) (0.13) (0.30) (0.42)

1.55 10

P S MP P A P T P G P C

A: 0.15

T: 0.13

G: 0.30

C: 0.42

Sequence ClassificationPROBLEM: Given a sequence, is it an island?

– We can calculate P(S|MP), but what is a sufficient P value?

SOLUTION: compare to a null model and calculate log-likelihood ratio– e.g. background DNA distribution model, B

A: 0.25

T: 0.25

G: 0.25

C: 0.25

A: 0.25

T: 0.25

G: 0.25

C: 0.25

PathogenicityIslands

Background DNA

11

( | ) ( | )( | )log log log

( | ) ( | ) ( | )

N Ni i

ii i i

P S MP P S MPP S MPScore

P S B P S B P S B

A: -0.73

T: -0.94

G: 0.26

C: 0.74

A:

T:

G:

C:

Score MatrixA: 0.15

T: 0.13

G: 0.30

C: 0.42

Finding Islands in Sequences

• Could use the log-likelihood ratio on windows of fixed size– What if islands have variable length?

• We prefer a model for entire sequence

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCAGACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

A More Complex Model

Background Island

0.15

0.25

0.750.85

A: 0.25T: 0.25G: 0.25C: 0.25

TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCAGACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC

A: 0.15T: 0.13G: 0.30C: 0.42

P

BB

PP

B

P P

B

P

B

P

B

P

B

P

B

A Generative Model

P P

B B B

P P

C A A A T G CGS:

B B B

P P P

B B

A: 0.42T: 0.30G: 0.13C: 0.15

A: 0.25T: 0.25G: 0.25C: 0.25

P(S|P)P(S|B)P(Li+1|Li)

Bi+1 Pi+1

Bi0.85 0.15

Pi0.25 0.75

A Hidden Markov Model

Hidden States L = { 1, ..., K }

Transition probabilitiesaij = Transition probability from state i to state j

Emission probabilitiesei(b) = P( emitting b | state=i)

Initial state probability (b) = P(first state=b)

State i State j

ej(b)ei(b)

EmissionProbabilities

TransitionProbabilities

What can we do with this model?

The model defines a joint probability over labels and sequences, P(L,S)

Implicit in model is what labels “tend to go” with what sequences (and vice versa)

Rules of probability allow us to use this model to analyze existing sequences

Fundamental HMM Operations

Decoding• Given an HMM and sequence S• Find a corresponding sequence of

labels, L

Evaluation• Given an HMM and sequence S• Find P(S|HMM)

Training• Given an HMM w/o parameters

and set of sequences S• Find transition and emission

probabilities the maximize P(S | params, HMM)

Computation Biology

Annotate pathogenicity islands on a new sequence

Score a particular sequence (not as useful for this model – will come back to this later)

Learn a model for sequence composed of background DNA and pathogenicity islands

The Hidden in HMM

• DNA does not come conveniently labeled (i.e. Island, Gene, Promoter)

• We observe nucleotide sequences

• The hidden in HMM refers to the fact that state labels, L, are not observed– Only observe emissions (e.g.

nucleotide sequence in our example)

State i State j

…A A G T T A G A G…

“Decoding” With HMM

Pathogenicity Island Example

Given a nucleotide sequence, we want a labeling of each nucleotide as either “pathogenicity island” or “background

DNA”

Given observables, we would like to predict a sequence of hidden states that is most likely to

have generated that sequence

The Most Likely Path

• Given a sequence, one reasonable choice for a labeling is:

* arg max ( , | )labels

L P Labels Sequence Model

The sequence of labels, L*, (or path) that makes the labels and sequence most likely given the

model

Probability of a Path,Seq

P

B

P

B

P

B B

P

B B

P

B

P

B

G C A A A T G C

L:

S:

PP

1 0 2 1 3 2 7

6

7 8

( | ) ( | ) ( | ) ( | ) ( | ) ( | )... ( | )

(0.85) (0.25)

4.9 10

P P G B P B B P C B P B B P A B P B B P C B

0.25 0.25

B B B

0.25

0.85 0.85 0.85 0.85B B B B B

0.85

0.25

0.85

0.25 0.25 0.25 0.25

0.85

Probability of a Path,Seq

P

B

P

B

P

B B

P

B B

P

B

P

B

G C A A A T G C

L:

S:

PP

1 0 2 1 3 2 7

7

3 6 2 2

( | ) ( | ) ( | ) ( | ) ( | ) ( | )... ( | )

(0.85) (0.25) (0.75) (0.42) 0.30 0.15

6.7 10

P P G B P B B P C B P B B P A B P P B P C B

B B B B B0.85

0.25

0.85

0.15 0.25

0.25 0.25 0.42 0.42 0.30 0.25 0.25

0.85

P P P0.750.75

We could try to calculate the probability of every path, but….

Decoding

• Viterbi Algorithm– Finds most likely sequence of labels, L*, given

sequence and model

– Uses dynamic programming (same technique used in sequence alignment)

– Much more efficient than searching every path

* arg max ( , | )labels


Probability of a Single Label

• Calculate most probable label, L*i , at each position i

• Do this for all N positions gives us {L*1, L*

2, L*3…. L*

N}

P

B

P

B

P

B B

P

B B

P

B

P

B

G C A A A T G C

L:

S:

PPP

B

P

B

P

B B

P

B B

P

B

P

B

PP

Sum over all paths

P(Label5=B|S)Forward algorithm(dynamic programming)

• Viterbi Algorithm– Finds most likely sequence of labels, L*, given

sequence and model

• Posterior Decoding– Finds most likely label at each position for all

positions, given sequence and model

{L*1, L*

2, L*3…. L*

N}

– Forward and Backward equations

Two Decoding Options

* arg max ( | , )labels


Application: Bacillus subtilis

Method

Nicolas et al (2002) NAR

Gene+ Gene-

AT Rich

Second Order Emissions

P(Si)=P(Si|State,Si-1,Si-2)(capturing trinucleotide

Frequencies)

Train using EM

Predict w/Posterior Decoding

Three State Model

Results

Nicolas et al (2002) NAR

Gene on positive strand

Each line is P(label|S,model)

color coded by label

Gene on negative strand

A/T Rich- Intergenic regions- Islands

Fundamental HMM Operations

Decoding• Given an HMM and sequence S• Find a corresponding sequence of

labels, L

Evaluation• Given an HMM and sequence S• Find P(S|HMM)

Training• Given an HMM w/o parameters

and set of sequences S• Find transition and emission


Computation Biology

Annotate pathogenicity islands on a new sequence

Score a particular sequence (not as useful for this model – will come back to this later)

Learn a model for sequence composed of background DNA and pathogenicity islands

Training an HMM

Transition probabilitiese.g. P(Pi+1|Bi) – the probability of entering a pathogenicity island from background DNA

Emission probabilitiesi.e. the nucleotide frequencies for background DNA and pathogenicity islands

B P

P(S|P)P(S|B)

P(Li+1|Li)

Learning From Labelled Data

P

B

P

B

P

B B

P

B B

P

B

P

B

G C A A A T G C

L:

S:

If we have a sequence that has islands marked, we can simply count

A: T: G: C:

A: 1/5T: 0G: 2/5C: 2/5


Bi+1 Pi+1 End

Bi3/5 1/5 1/5

Pi1/3 2/3 0

Start 1 0 0

Endstart

P

B B B B B

P

ETC..

Maximum Likelihood Estimation

!

Unlabelled Data

P

B

P

B

P

B B

P

B B

P

B

P

B

G C A A A T G C

L:

S:

How do we know how to count?

A: T: G: C:

A:T: G:C:


Bi+1 Pi+1 End

Bi

Pi ?Start

Endstart

PP

?

Unlabeled Data

An idea:

1. Imagine we start with some parameters

2. We could calculate the most likely path, P*, given those parameters and S

3. We could then use P* to update our parameters by maximum likelihood

4. And iterate (to convergence)

P

B

P

B

P

B B

P

B B

P

B

P

B

G C A A A T G C

L:

S:

P(S|P)0P(S|B)0P(Li+1|Li)0

Endstart

PP



P(S|P)KP(S|B)KP(Li+1|Li)K

…

B B BB B B B BB B B B B

P P P

1. Initialize parameters

2. E Step Estimate probability of hidden labels , Q, given parameters and sequence

3. M Step Choose new parameters to maximize expected likelihood of parameters given Q

4. Iterate

Expectation Maximization (EM)

( | , )1Q P Labels S paramst

1arg max log ( , | )t tQ

paramsparams E P S labels params

P(S|Model) guaranteed to increase each iteration

Expectation Maximization (EM)

EM frequently used in motif discovery

Lecture 3

Remember the basic idea!

1.Use model to estimate (distribution of) missing data2.Use estimate to update model

3.Repeat until convergence

EM is a general approach for learning models (ML estimation) when there is “missing data”

Widely used in computational biology

A More Sophisticated Application

• Given amino acid sequences from a protein family, how can we find other members?– Can search databases with each known member – not

sensitive– More information is contained in full set

• The HMM Profile Approach– Learn the statistical features of protein family – Model these features with an HMM– Search for new members by scoring with HMM

Modeling Protein Families

We will learn features from multiple alignments

UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISKUBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISKBAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSKUBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSKUBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISKUBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISKUBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRTAAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIATUBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRTCDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRTBAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRTUBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVETUBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSSUBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQE2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRHUBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQUBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQUBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTNUBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS

Human Ubiquitin Conjugating Enzymes

Profile HMM

Ij

Start M1 Mj MN End

DjD1 DN

I I1 IN

ACDEFGHIKLMNOPQRSTVWY

ACDEFGHIKLMNOPQRSTVWY

A------------DSAG-

E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGANUBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEKUBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDENUBE2H LPDKYPFKSPSIGFMNKIFHPNIDEA

-GEICVNVLKR WTAELGIRHQVCLPVI A-----------ENWKPATKTDQ

-GQICLPIISSA-----------ENWKPCTKTCQSGTVCLDVIN-P-----------QTWTALYDLTN

Using Profile HMMs

Decoding Find sequence of labels, L,

that maximizes P(L|S, HMM)

Evaluation• Find P(S|HMM)

Training• Find transition and emission


Computation Biology

Align a new sequence to a protein family

Score a sequence for membership in family

Discover and model family structure

Example: Modeling Globins

• Profile HMM from 300 randomly selected globin genes

• Score database of 60,000 proteins

PFAM Collection of Profile HMMs

http://www.sanger.ac.uk/Software/Pfam/

PFAM Resources• 8957 curated protein

families and domains• Each with HMM profile(s)• Coverage

– 73% of proteins in Swissprot and SP-TREMBLE

– 53% of “typical” genome sequence

Example PFAM Entry

• Literature Links• Protein Structure• Domain Architectures• GO Functional Categories

Lab 1

HMMER

• Implementation of Profile HMM methods

• Given a multiple alignment, HMMER can build a Profile HMM

• Given a Profile HMM (i.e. from PFAM), HMMER can score sequences for membership in the family or domain

HMMs in Context

• HMMs– Sequence alignment– Gene Prediction

• Generalized HMMs– Variable length states – Complex emissions models– e.g. Genscan

• Bayesian Networks– General graphical model– Arbitrary graph structure– e.g. Regulatory network analysis

References• Sean R Eddy, “Hidden Markov models,” Current Opinion in Structural Biology,

6:361-365, 1996.

• Sean R Eddy, “Profile hidden Markov models,” Bioinformatcis, 14(9):755-763, 1998.

• Anders Krogh, “An introduction to hidden Markov models for biological sequences,” In computational Methods in Molecular Biology, edited by S. L. Salzberg, D. B. Searls and S. Kasif, pp. 45-63, Elsevier, 1998.

• HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/

• Erik L. L. Sonnhammer et al, “Pfam: multiple sequence alignments andHMM-profiles of protein domains,” Nucleic Acids Research, 26(1):320-322, 1998.

• R. Durbin, S. Eddy, A. Krogh and G. Mitchison, BIOLOGICAL SEQUENCE ANALYSIS, Cambridge University Press, 1998.

Tomorrow’s Lab

• Basic Sequence Analysis Tools– Argo Genome Browser– Blast– Gene prediction using Glimmer– Protein families with Hmmer and PFAM– Comparative synteny analysis

• Identify virulence factors by annotating and comparing virulent and avirulent bacterial sequences

The Hidden in HMM

• DNA does not come conveniently labeled (i.e. Pathogencity Island, Gene, Promoter)

• All we observe are the nucleotide sequences

• The hidden in HMM refers to the fact that the state labels, L, are not observed– Only observe emissions (e.g. nucleotide sequence

in our example)

Relation between Viterbi and ForwardVITERBI

Vj(i) = P(most probable path ending in state j with observation i)

Initialization:V0(0) = 1Vk(0) = 0, for all k > 0

Iteration:

Vj(i) = ej(xi) maxk Vk(i-1) akj

Termination:

P(x, *) = maxk Vk(N)

FORWARD

fl(i)=P(x1…xi,statei=j)

Initialization:f0(0) = 1fk(0) = 0, for all k > 0

Iteration:

fl(i) = el(xi) k fk(i-1) akl

Termination:

P(x) = k fk(N) ak0

Slide Credit: Serafim Batzoglou

welcome to introduction to computational genomics for infectious disease

Documents

computational analysisconcepts

different concepts

random variablesnucleotides

infectious diseaseexamples

biological data

broad institute

toolsapplication of

entranceindividual computers