statistical machine learning and computational biology

67
Statistical Machine Learning and Computational Biology Michael I. Jordan University of California, Berkeley November 5, 2007

Upload: gil

Post on 25-Feb-2016

46 views

Category:

Documents


4 download

DESCRIPTION

Statistical Machine Learning and Computational Biology. Michael I. Jordan University of California, Berkeley November 5, 2007. Statistical Modeling in Biology. Motivated by underlying stochastic phenomena thermodynamics recombination mutation environment Motivated by our ignorance - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Statistical Machine Learning and Computational Biology

Statistical Machine Learning and Computational Biology

Michael I. JordanUniversity of California, Berkeley

November 5, 2007

Page 2: Statistical Machine Learning and Computational Biology

Statistical Modeling in Biology

• Motivated by underlying stochastic phenomena thermodynamics recombination mutation environment

• Motivated by our ignorance evolution of molecular function protein folding molecular concentrations incomplete fossil record

• Motivated by the need to fuse disparate sources of data

Page 3: Statistical Machine Learning and Computational Biology

Outline

• Graphical models phylogenomics

• Nonparametric Bayesian models protein backbone modeling multi-population haplotypes

• Sparse regression protein folding

Page 4: Statistical Machine Learning and Computational Biology

Part 1: Graphical Models

Page 5: Statistical Machine Learning and Computational Biology

5

Probabilistic Graphical Models

X1

X2

X3

X4 X5

X6

• Given a graph G = (V,E), where each node vV is associated with a random variable Xv

p(x1, x2, x3, x4, x5, x6) = p(x1) p(x2| x1)p(x3| x2) p(x4| x1)p(x5| x4)p(x6| x2, x5)

• The joint distribution on (X1, X2,…, XN) factors according to the “parent-of” relation defined by the edges E :

p(x6| x2, x5)

p(x1)

p(x5| x4)p(x4| x1)

p(x2| x1)

p(x3| x2)

Page 6: Statistical Machine Learning and Computational Biology

6

Inference• Conditioning

• Marginalization

• Posterior probabilities

Page 7: Statistical Machine Learning and Computational Biology

Inference Algorithms

• Exact algorithms sum-product junction tree

• Sampling algorithms Metropolis-Hastings Gibbs sampling

• Variational algorithms mean-field Bethe, Kikuchi convex relaxations

Page 8: Statistical Machine Learning and Computational Biology

8

Hidden Markov Models

• Widely used in computational biology to parse strings of various kinds (nucleotides, markers, amino acids) • Sum-product algorithm yields

Page 9: Statistical Machine Learning and Computational Biology

9

Hidden Markov Model Variations

Page 10: Statistical Machine Learning and Computational Biology

10

Phylogenies

• The shaded nodes represent the observed nucleotides at a given site for a set of organisms

• Site independence model (note the plate)• The unshaded nodes represent putative ancestral

nucleotides• Computing the likelihood involves summing over the

unshaded nodes

Page 11: Statistical Machine Learning and Computational Biology

11

Hidden Markov Phylogeny

• This yields a gene finder that exploits evolutionary constraints

Evolutionary rate is state-dependent (edges from state to nodes in phylogeny are omitted for

simplicity)• Based on sequence data from 12-15 primate species, we

obtain a nucleotide sensitivity of 100%, with a specificity of 89%

GENSCAN yields a sensitivity of 45%, with a specificity of 34%

Page 12: Statistical Machine Learning and Computational Biology

Annotating new genomes

>Q8X1T6 (hypothetical protein) MCPPNTPYQSQWHAFLHSLPKCEHHVHLEGCLEPPLIFSMARKNNVSLPSPSSNPAYTSVETLSKRYGHFSSLDDFLSFYFIGMTVLKTQSDFAELAWTYFKRAHAEGVHHTEVFFDPQVHMERGLEYRVIVDGYVDGCKRAEKELGISTRLIMCFLKHLPLESAQRLYDTALNEGDLGLDGRNPVIHGLGASSSEVGPPKDLFRPIYLGAKEKSINLTAHAGEEGDASYIAAALDMGATRIDHGIRLGEDPELMERVAREEVLLTVCPVSNLQLKCVKSVAEVPIRKFLDAGVRFSINSDDPAYFGAYILECYCAVQEAFNLSVADWRLIAENGVKGSWIGEERKNELLWRIDECVKRF

What molecular function does protein Q8X1T6 have?

Species: Aspergillus nidulans (Fungal organism)

Images courtesy of Broad Institute, MIT

Page 13: Statistical Machine Learning and Computational Biology

Annotation Transfer

• Species Name Molecular Function Score E-value• Schizosaccharomyces pomb adenosine deaminase 390 e-107• Gibberella zeae hypothetical protein FG01567.1 345 7e-94• Saccharomyces cerevisiae adenine deaminase 308 1e-82• Wolinella succinogenes putative adenosine deaminase 268 1e-70• Rhodospirillum rubrum adenosine deaminase 266 6e-70• Azotobacter vinelandii adenosine deaminase 260 4e-68• Streptomyces coelicolor probable adenosine deaminase 254 2e-68• Caulobacter crescentus CB1 adenosine deaminase 253 5e-66• Streptomyces avermitilis putative adenosine deaminase 251 2e-65• Ralstonia solanacearum adenosine deaminase 251 2e-65• environmental sequence unknown 246 5e-64• Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63• Pseudomonas aeruginosa adenosine deaminase 245 1e-63• environmental sequence unknown 244 3e-63• Pseudomonas fluorescens adenosine deaminase 243 7e-63• Pseudomonas putida KT2440 adenosine deaminase 243 7e-63

BLAST Search: Q8X1T6 (Aspergillus nidulans)

Page 14: Statistical Machine Learning and Computational Biology

Methodology to System: SIFTER

Species Name Molecular Function Score E-valueSchizosaccharomyces pombe adenosine deaminase 390 e-107Gibberella zeae hypothetical protein FG01567.1 345 7e-94Saccharomyces cerevisiae adenine deaminase 308 1e-82Wolinella succinogenes putative adenosine deaminase 268 1e-70Rhodospirillum rubrum adenosine deaminase 266 6e-70Azotobacter vinelandii adenosine deaminase 260 4e-68Streptomyces coelicolor probable adenosine deaminase 254 2e-68Caulobacter crescentus adenosine deaminase 253 5e-66Streptomyces avermitilisputative adenosine deaminase 251 2e-65Ralstonia solanacearum adenosine deaminase 251 2e-65environmental sequenceunknown 246 5e-64Pseudomonas aeruginosa probable adenosine deaminase 245 1e-63Pseudomonas aeruginosa adenosine deaminase 245 1e-63environmental sequenceunknown 244 3e-63Pseudomonas fluorescens adenosine deaminase 243 7e-63Pseudomonas putida KT adenosine deaminase 243 7e-63

MP

Gene Tree

Species Tree

Set of homologous proteins (Pfam)

adenosine

adenosine

adenosine

adenine

adenineGene Ontology SIFTER

Page 15: Statistical Machine Learning and Computational Biology

Functional diversity problem1887 Pfam-A families with more than two experimentally

characterized functions

2-5 different functions

6-10 different functions

11-20 different functions

21-50 different functions

>51 different functions

Page 16: Statistical Machine Learning and Computational Biology

Available methods for comparison

• Sequence similarity methods• BLAST [Altschul 1990]: sequence similarity search,

transfer annotation from sequence with most significant similarity

• Runs against largest curated protein database in world• GOtcha [Martin 2004]: BLAST search on seven genomes

with GO functional annotations • GOtcha runs use all available annotations• GOtcha-exp runs use only available experimental annotations

• Sequence similarity plus bootstrap orthology• Orthostrapper [Storm 2002]: transfer annotation when

query protein is in statistically supported orthologous cluster with annotated protein

Page 17: Statistical Machine Learning and Computational Biology

AMP/adenosine deaminase • 251 member proteins in Pfam v. 18.0• 13 proteins with experimental evidence

GOA• 20 proteins with experimental annotations

from manual literature search• 129 proteins with electronic annotations

from GOA• Molecular function: remove amine group

from base of substrate• Alignment from Pfam family seed

alignment• Phylogeny built with PAUP* parsimony,

BLOSUM50 matrix

Mouse adenosine deaminase, courtesy PDB

Page 18: Statistical Machine Learning and Computational Biology

AMP/adenosine deaminase

SIFTER Errors

Leave-one-out cross-validation: 93.9% accuracy (31 of 33)BLAST: 66.7% accuracy (22 of 33)

Page 19: Statistical Machine Learning and Computational Biology

AMP/adenosine deaminase

Multifunction families: can choose numerical cutoff for posterior probability prediction using this type of plot

Note: x-axis is on log scale

Page 20: Statistical Machine Learning and Computational Biology

Sulfotransferases: ROC curve

• SIFTER (no truncation): 70.0% accuracy (21 of 30)• BLAST: 50.0% accuracy (15 of 30)

Note: x-axis is on log scale

Page 21: Statistical Machine Learning and Computational Biology

Nudix Protein Family

• 3703 proteins in the family

• 97 proteins with molecular functions characterized

• 66 different candidate molecular functions

Page 22: Statistical Machine Learning and Computational Biology

Nudix: SIFTER vs BLAST•SIFTER truncation level 1: 47.4% accuracy (46 of 97)•BLAST: 34.0% accuracy (33 of 97); 23.3% of terms at all in search results

Page 23: Statistical Machine Learning and Computational Biology

Trade specificity for accuracy• Leave-one-out cross-validation, truncation at 1: 47.4% accuracy

66 candidate functions

15 candidate functions

Leave-one-out cross-validation, truncation at 1,2: 78.4% accuracy

Page 24: Statistical Machine Learning and Computational Biology

Fungal genomes

Archeascomycota

Basidiomycota

Hem

iasc

omyc

ota

Work with Jason Stajich; Images courtesy of Broad Institute

Euas

com

ycot

a

Zygomycota

Page 25: Statistical Machine Learning and Computational Biology

Fungal Genomes Methods

• Gene finding in all 46 genomes• hmmsearch for all 427,324 genes• Aligned hits with hmmalign to 2,883 Pfam v. 20 families• Built trees using PAUP* maximum parsimony for 2,883 Pfam

v. 20 families; reconciled with Forester • BLASTed each protein against Swiss-Prot/TrEMBL for exact

match; used ID to search for GOA annotations • Ran SIFTER with (a) experimental annotations only and (b)

experimental and electronic annotations

Page 26: Statistical Machine Learning and Computational Biology

SIFTER Predictions by Species

Page 27: Statistical Machine Learning and Computational Biology

Part 2: Nonparametric Bayesian Models

Page 28: Statistical Machine Learning and Computational Biology

Clustering

• There are many, many methodologies for clustering

• Heuristic methods hierarchical clustering

• M estimation K means spectral clustering

• Model-based methods finite mixture models Dirichlet process mixture models

Page 29: Statistical Machine Learning and Computational Biology

Nonparametric Bayesian Clustering

• Dirichlet process mixture models are a nonparametric Bayesian approach to clustering

• They have the major advantage that we don’t have to assume that we know the number of clusters a priori

Page 30: Statistical Machine Learning and Computational Biology

Chinese Restaurant Process (CRP)

• Customers sit down in a Chinese restaurant with an infinite number of tables first customer sits at the first table th subsequent customer sits at a table

drawn from the following distribution:

where is the number of occupants of table

Page 31: Statistical Machine Learning and Computational Biology

The CRP and Mixture Models

• The customers around a table form a cluster associate a mixture component with each

table the first customer at a table chooses from

the prior e.g., for Gaussian mixtures, choose

• It turns out that the (marginal) distribution that this induces on the theta’s is exchangeable

1 2 3 4

Page 32: Statistical Machine Learning and Computational Biology

Example: Mixture of Gaussians

Page 33: Statistical Machine Learning and Computational Biology

Dirichlet Process

0

• Exchangeability implies an underlying stochastic process; that process is known as a Dirichlet process

1

Page 34: Statistical Machine Learning and Computational Biology

34

• Given observations , we model each with a latent factor :

• We put a Dirichlet process prior on :

Dirichlet Process Mixture Models

Page 35: Statistical Machine Learning and Computational Biology

Connection to the Chinese Restaurant

• The marginal distribution on the theta’s obtained by marginalizing out the Dirichlet process is the Chinese restaurant process

• Let’s now consider how to build on these ideas and solve the multiple clustering problem

Page 36: Statistical Machine Learning and Computational Biology

Multiple Clustering Problems

• In many statistical estimation problems, we have not one data analysis problem, but rather we have groups of related problems

• Naive approaches either treat the problems separately, lump them together, or merge in some adhoc way; in statistics we have a better sense of how to proceed: shrinkage empirical Bayes hierarchical Bayes

• Does this multiple group problem arise in clustering? I’ll argue “yes!”

• If so, how do we “shrink” in clustering?

Page 37: Statistical Machine Learning and Computational Biology

Multiple Data Analysis Problems

• Consider a set of data which is subdivided into groups, and where each group is characterized by a Gaussian distribution with unknown mean:

• Maximum likelihood estimates of are obtained independently

• This often isn’t what we want (on theoretical and practical grounds)

Page 38: Statistical Machine Learning and Computational Biology

38

Hierarchical Bayesian Models• Multiple Gaussian distributions linked by a shared hyperparameter

• Yields shrinkage estimators for the

Page 39: Statistical Machine Learning and Computational Biology

Protein Backbone Modeling

• An important contribution to the energy of a protein structure is the set of angles linking neighboring amino acids

• For each amino acid, it turns out that two angles suffice; traditionally called φ and ψ

• A plot of φ and ψangles across some ensemble of amino acids is called a Ramachandran plot

Page 40: Statistical Machine Learning and Computational Biology

A Ramachandran Plot

• This can be (usefully) approached as a mixture modeling problem

• Doing so is much better than the “state-of-the-art,” in which the plot is binned into a three-by-three grid

Page 41: Statistical Machine Learning and Computational Biology

Ramachandran Plots• But that plot is an overlay of 400 different plots, one for

each combination of 20 amino acids on the left and 20 amino acids on the right

• Shouldn’t we be treating this as a multiple clustering problem?

Page 42: Statistical Machine Learning and Computational Biology

Haplotype Modeling

• A haplotype is the pattern of alleles along a single chromosome

• Data comes in the form of genotypes, which lose the information as to which allele is associated to which member of a pair of homologous chromosomes:

• Need to restore haplotypes from genotypes• A genotype is well modeled as a mixture model,

where a mixture component is a pair of haplotypes (the real difficulty is that we don’t know how many mixture components there are)

Page 43: Statistical Machine Learning and Computational Biology

Multiple Population Haplotype Modeling

• When we have multiple populations (e.g., ethnic groups) we have multiple mixture models

• How should we analyze these data (which are now available, e.g., from the HapMap project)?

• Analyze them separately? Lump them together?

Page 44: Statistical Machine Learning and Computational Biology

Scenes, Objects, Parts and Features

Page 45: Statistical Machine Learning and Computational Biology

Shared Parts

Page 46: Statistical Machine Learning and Computational Biology

Hidden Markov Models

• An HMM is a discrete state space model• The discrete state can be viewed as a cluster

indicator• We thus have a set of clustering problems, one

for each value of the previous state (i.e., for each row of the transition matrix)

Page 47: Statistical Machine Learning and Computational Biology

Solving the Multiple Clustering Problem

• It’s natural to take a hierarchical Bayesian approach

• It’s natural to take a nonparametric Bayesian in which the number of clusters is not known a priori

• How do we do this?

Page 48: Statistical Machine Learning and Computational Biology

48

Hierarchical Bayesian Models• Multiple Gaussian distributions linked by a shared hyperparameter

• Yields shrinkage estimators for the

Page 49: Statistical Machine Learning and Computational Biology

49

• Let us try to model each group of data with a Dirichlet process mixture model let the groups share an underlying

hyperparameter • But each group is generated

independently different groups cannot share the

same components if is continuous.

Hierarchical DP Mixture Model?

spikes do notmatch up

Page 50: Statistical Machine Learning and Computational Biology

Hierarchical Dirichlet Process Mixtures

Page 51: Statistical Machine Learning and Computational Biology

The Chinese Restaurant Franchise

Page 52: Statistical Machine Learning and Computational Biology

HDP Model of Ramachandran Plots• We would like to solve 400 different related clustering problems,

one for each combination of 20 amino acids on the left and 20 amino acids on the right

Page 53: Statistical Machine Learning and Computational Biology

53

HDP Model of Ramachandran Plots

Page 54: Statistical Machine Learning and Computational Biology

Some HDP Success Stories

• New backbone model for Rosetta• New method for multi-population haplotype phasing• Solution to problem of choosing number of states in HMMs• State-of-the-art method for statistical parsing• Competitive method for image denoising• Competitive method for scene categorization• State-of-the-art method for object recognition

Page 55: Statistical Machine Learning and Computational Biology

Part 3: Sparse Regression

Page 56: Statistical Machine Learning and Computational Biology

Rosetta Ab Initio Search• Very successful method from David Baker’s lab (UW)

Consistent top performer at CASP Already used in real-world problems (e.g. HIV vaccine

design)• Monte Carlo procedure

Treat energy (actually ) as a probability density, sample from it

• Primary move set: fragment insertion

Fragments come from library of solved structures with similar residue subsequences

Energetically plausible local solutions

Like a coordinate descent move—jump to new local minimum in a few coordinates

Page 57: Statistical Machine Learning and Computational Biology

Rosetta Ab Initio Search• Start from fully extended chain.• Repeat:

• Propose fragment insertion (or other) move• Do local gradient descent to evaluate proposal• Accept or reject by a Metropolis criterion

• Throw away all but lowest energy sample from previous round

• Switch to high resolution (full-atom) energy function, perform further search (relaxation)

• Return single lowest energy conformation seen in sampling.

Run many, many times.

Page 58: Statistical Machine Learning and Computational Biology

Our idea: resampling• No more blind repetition—learn where to search from

previous searches• An initial round of sampling gives us lots of information

about the search space Areas of conformation space that always have very

poor energy Structural elements that are predicted very consistently

• How can we use this information to guide further sampling (without getting too greedy?)

• General approach: • Define a reduced search space• Learn a smoothed energy function from the decoys• Minimize smoothed energy to find new decoys• Repeat

Page 59: Statistical Machine Learning and Computational Biology

• Restrict attention to local minima (decoys)• Fit a smoothed response surface to the decoys• Minimize response surface to find new candidate• Full-atom relax candidate• Add candidate to decoy pool• Re-fit surface, repeat

Step 2: Response surface minimization

Page 60: Statistical Machine Learning and Computational Biology

Features

• Torsion angle features e.g., from the HDP model of the

Ramachandran plot

• Secondary structure features

Sheet Loop Helix

φ

ψ

(180, 180)

(-180, -180)

G

E

E

A

B

B

Page 61: Statistical Machine Learning and Computational Biology

More Features• Sidechain rotamer features

• Burial features

Buried Exposed

• Register shift features

Page 62: Statistical Machine Learning and Computational Biology

Sparse Regression Models• Lasso regression: penalize large weights

• L1 regularization leads to sparse solutions• LARS (Efron et al. 2004) : find estimates for all C

simultaneously, as efficiently as least-squares

Page 63: Statistical Machine Learning and Computational Biology

Results: 1ogw

Page 64: Statistical Machine Learning and Computational Biology

Results: 1n0u

Page 65: Statistical Machine Learning and Computational Biology

Results: 1di2

Page 66: Statistical Machine Learning and Computational Biology

Collaborators

• David Baker• Ben Blum• Steven Brenner• Roland Dunbrack• Barbara Engelhardt• Guillaume Obozinski• Yee Whye Teh• Daniel Ting• Eric Xing

http://www.cs.berkeley.edu/~jordan

Page 67: Statistical Machine Learning and Computational Biology

Finis

http://www.cs.berkeley.edu/~jordan

• For more information (papers, slides, tutorials, software: