inferring protein structure with discriminative learning and network diffusion

Inferring Protein Structure with Discriminative Learning and

Network Diffusion

Rui KuangDepartment of Computer Science

Center for Computational Learning SystemsColumbia University

Thesis defense, August 16th 2006

Agenda

Biological Background Protein Classification with String Kernels Domain Identification and Boundary Detection Protein Ranking with Network Diffusion Conserved Motifs between Remote Homologs Conclusion & Future Research

What are proteins?

Proteins – encoded by genes A protein (polypeptide chain) is a sequence of

amino acid residues Derived from Greek word proteios meaning “o

f the first rank” by Jöns J. Berzelius in 1838

(Picture courtesy of Branden & Tooze )

Amino Acid Polypeptide Chain

BackgroundStructural classificationDomain segmentationProtein rankingMotif discoveryConclusion & Future Research

Why study proteins?

Proteins play crucial functional roles in all biological processes: enzymatic catalysis, signaling messengers, structural elements…

Function depends on unique 3-D structure. Easy to obtain protein sequences, difficult to

determine structure.

Structure Function

determine

(Picture courtesy of Branden & Tooze )

fold

Sequence

NLAFALSELDRITAQLKLPRHVEEEAARLYREAVRKGLIRGRSIESVMAACVYAACRLLKVPRTLDEIADIARVDKKEIGRSYRFIARNLNLTPKKLF…

Sequence Space and Structure Space

Structure (38,086 known in PDB): discrete groups of folds with unclear boundaries (by 8/14/2006)

Sequence (>2,500,000)

•Homologous proteins share >30% sequence identity, which suggests strong structural similarity.•Remote homologous proteins share similar structure but low sequence similarity.

Remote Homology Detection

Remote homology : remote evolutionary relationship conserved structure/function, low sequence similarity

It is often not possible to detect statistically significant sequence alignment between remote homologs

ADTIVAVELDTYPNTDIGDPSYPHIGIDIKSVRSKKTAKWNMQNGKVGTAHIIYNSVDKRLSAVVSYPNADSATVSYDVDLDNVLPEWVRVGLSASTGLYKETNTILSWSFTSKLKSNSTHETNALHFMFNQFSKDQKDLILQGDATTGTDGNLELTRVSSNGPQGSSVGRALFYAPVHIWESSAVVASFEATFTFLIKSPDSHPADGIAFFISNIDSSIPSGSTGRLLGLFPDAN

MSLLPVPYTEAASLSTGSTVTIKGRPLVCFLNEPYLQVDFHTEMKEESDIVFHFQVCFGRRVVMNSREYGAWKQQVESKNMPFQDGQEFLSISVLPDKYQVMVNGQSSYTFDHRIKPEAVKMVQVWRDISLTKFNVSYLKR

<10% sequence identity

DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA

D + WEI+ +++ + ++ SGS+G +++G + +VA+K LK E + F EV

DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF

Protein Domains

Proteins often consist of several independent domains fold autonomically often function differently represent fundamental

structural, functional and evolutionary units

Example: a two-domain protein 3-layer(aba) sandwich at the

N-terminal a mainly alpha in an

orthogonal bundle at the C-terminal

Inferring protein structure/function from sequence similarity

For newly sequenced genomes, often homology detection can only identify less than a half of the genes.

Remote homology detection and domain segmentation are crucial steps for studying genes with no close homology.

Sequence

Alignment

Query sequence

Domain Family 3

Domain Family 2

Domain Family 1

Domain Family n

Domain Family n-1

Domain Family n-2

Domain database

Domain 1 Domain 2 Domain 3

Boundary identification

Remote homology detection We want to correctly segment protein sequences into domains

(domain boundary identification) and associate them with their corresponding structural/functional class (remote homology detection).

Protein Structural Classification Protein classification is the prediction of the structural or f

unctional class of a protein from its primary sequence. SCOP: Structural Classification of Proteins Known domain structures are organized in a hierarchy: fa

mily, superfamily and fold.

SCOP

Family : Sequence identity > 30% or functions and structures are very similar

Superfamily : low sequence similarity but functional features suggest probable common evolutionary origin


Remote Homology Detection in Protein Classification

Remote homologs: sequences that belong to the same super-family but not the same family.

SCOP

Fold

Super-family

FamilyPositive Training Set

Positive Test Set

Negative Training Set

Negative Test Set

SVM: Large margin-based discriminative learning approach.

Find a hyperplane to separate positive data from negative data and also maximize the margin.

A Quadratic Programming problem only depends on the inner product between data points.

Support Vector Machine (SVM) Classifiers

Kernel trick: To train an SVM, can use kernel rather than explicit feature map

Can define kernels for sequences, graphs, other discrete objects:

{ sequences } RN

Kernel value is inner product in feature space:

K(x, y) = (x), (y) Original string kernels (Watkins, Haussler, Lodhi et al.) r

equire quadratic time in sequence length, O(|x| |y|), to compute each kernel value K(x, y).

We introduce fast novel string kernels with linear time complexity.

Kernels for Discrete Objects

Profile Kernel and its Family Tree

Three generations Spectrum Kernels Mismatch Kernels Profile Kernels

Faster computation, i.e., linear computation time in sequence length.

Profile-based string kernels take advantage of abundant unlabeled data to capture homologous/evolutionary information for remote homology detection.

Leslie, Eskin and Noble, PSB 2002

Spectrum Kernel

Feature map indexed by all possible k-length subsequences (“k-mers”) from alphabet of amino acids, || = 20

Q1:AKQDYYYYE

AKQ KQD QDY DYY YYY YYY

YYE

Q2:DYYEIAKQE

DYY YYE YEI EIA IAK AKQ

KQE

Feature Space(AAA-YYY)1 AKQ 11 DYY 1 0 EIA 10 IAK 11 KQD 00 KQE 1 1 QDY 0 0 YEI 11 YYE 12 YYY 0

K(Q1,Q2)= <(…1…1…0…0…1…0…1…0…1…2),(…1…1…1…1…0…1…0…1…1…0)> =3

K-mers capture some position-independent local similarity, but they don’t effectively model evolutionary divergence.

Leslie, Eskin, Weston and Noble, NIPS 2002

Mismatch Kernel

For k-mer s, the mismatch neighborhood N(k,m)

(s) is the set of all k-mers t within m mismatches from s

Size of mismatch neighborhood is O(||mkm)

AKQCKQ

DKQ AAQAKY… …

( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AAQ AKY CKQ DKQ

AKQ

Arbitrary mismatch does not model the mutation probability between amino acids.

Profile Kernel Profile kernel: specialized to protein sequences, probab

ilistic profiles to capture homology information Semi-supervised approach: profiles are estimated using

unlabeled data (about 2.5 million proteins ) E.g. PSI-BLAST profiles: estimated by iteratively alignin

g database homologs to query sequence. Profiles are build from multiple sequence alignment to

model the positional mutation probability.

L K L …A 3 -2 1 …C -1 0 2 …D -1 0 0 …… … … … …Y 2 -3 -3 …

QUERY LKLLRFLGSGAFGEVYEGQLKTE....DSEEPQRVAIKSLRK.......

HOMOLOG1 IIMHNKLGGGQYGDVYEGYWK........RHDCTIAVKALK........

HOMOLOG2 LTLGKPLGEGCFGQVVMAEAVGIDK.DKPKEAVTVAVKMLKDD......

HOMOLOG3 IVLKWELGEGAFGKVFLAECHNLL...PEQDKMLVAVKALK........

[Kuang & Leslie, JBCB 2005 and CSB 2004]

Profile-based k-mer Map

Use profile to define position-dependent mutation neighborhoods:

E.g. k=3, =5 and a profile of negative log probabilities

xjbbpxP j 1,),()(

AKQYKQ

(2+1+1<)AKQ

(1+1+1<)AKC

(1+1+2<)

YKC(2+1+1<)

( 0 , … , 1 , … , 1 , … , 1 , … , 1 , … , 0 ) AKC AKQ YKC YKQ

AKQ

A K Q …A 1 3 4 …C 5 4 1 …D 4 4 4 …… … … … …K 4 1 4 …… … … … …Q 3 4 1 …… … … … …Y 2 4 3 …

i iijk

k

bpbbb

kjjxPM

log:

:1

21

,

Efficient Computing with Trie

Use trie data structure to organize lexical traversal of all instances of k-mers in the training profiles.

Scales linearly with length, O(km_max+1||m_max(|x|+|y|)), where m_max is maximum number of mismatches that occur in any mutation neighborhood.

E.g. k=3, =5

AQ

C

Update K(x, y) by adding contribution for feature AQC but not AQD

A Q K …A 1 3 2 …C 3 2 1 …D 3 2 1 …… … … … …Q 3 1 2 …… … … … …Y 2 1 3 …

Sequence x … A Q Y …A ….5 2 1 …C … 2 1 2 …D … 2 1 4 …… … … … … …Q … 2 .6 2 …… … … … … …Y … 3 3 3 …

Sequence y

D

x: 1+1+1< y: .5+.6+2 <

x: 1+1+1 < y: .5+.6+4 >

Inexact Matching Kernels[Leslie & Kuang, JLMR 2004, KMCB 2004 & COLT 2003]

Gappy kernels For g-mer s, g > k, the gapped match set G(g,k)(s) consists

of all k-mers t that occur in s with up to (g - k) gaps Wildcard kernels

Introduce wildcard character “”, define feature space indexed by k-mers from {}, allowing up to m wildcards

Substitution kernels Use substitution matrices to obtain P(a|b), substitution pro

babilities for residues a, b The mutation neighborhood M(k,)(s) is the set of all k-mers

t such that

- i=1…k log P(si|ti) <

Experiments

SCOP 1.59 benchmark with 54 experiments Train PSI-BLAST profiles on NR database Comparison against PSI-BLAST and recent SVM-ba

sed methods: PSI-BLAST rank: use training sequence as query and

rank testing sequences with PSI-BLAST e-value eMotif Kernel (Ben-Hur et al., 2003): features are kno

wn protein motifs, stored using trie SVM-pairwise (Liao & Noble, 2002): feature vectors of

pairwise alignment scores (e.g. PSI-BLAST scores) Cluster Kernel (Weston et al., 2003): Implicitly average

the feature vectors for sequences in the PSI-BLAST neighborhood of input sequence

Results

Results (Cont.)

Kernels ROC ROC50

PSI-BLAST 0.743 0.293

eMotif 0.711 0.247

Mismatch(5,1) 0.875 0.416

Gappy(6,4) 0.851 0.387

Substitution(4,6.0) 0.876 0.441

Wildcard(5,2,1.0) 0.881 0.447

SVM-Pairwise 0.866 0.533

Cluster 0.923 0.699

Profile(5,7.5)-5 Iteration 0.984 0.874

Identify Protein Domains and Domain Boundaries SVM-based remote homology detection methods do not rely

on sequence alignment. To learn the domain segmentation, we use our SVM classifiers

as domain recognizers and find the optimal segmentation giving the maximum sum of the classification scores.


SCOP

Query sequence

SVM1 SVM2 SVM3 SVM4

Boundaries

Domain recognizers:

Super-families:

Algorithms for finding optimal segmentation

Assuming we know the number of domains on a protein, we look for the optimal segmentation with the maximum sum of classification scores with dynamic programming. No gaps:

Allowing gaps (can also be solved as a LP problem):

cj

TSFT

NjSFT

kcjkjk

jc

jj

when ,inf

otherwise ,))((max

1for ,)(

),1(,1

),(

,1),1(

cj

TSFT

NjSFT

kclkjlk

jc

lkjlkj

when ,inf

otherwise ,))((max

1 all ,)(max

),1(,1

),(

,1),1(

: segment between position i and position j of sequence S : the best classification score of segment s: the maximum sum of classification scores from c segments on S1,j),(

,

)(

jc

ji

T

sF

S

Toy Example of Dynamic Programming

1 4 2 2 1 1c=1

c=2

c=3

-INF 2 5 8 6 7

-INF -INF 3 6 9 13

Algorithms for finding optimal segmentation (unconstrained number of domains)

The regions across boundaries are less classifiable than other regions within one domain.

Use dynamic programming to alternate between domain regions and boundary regions.

),)((max

))((max

0,0

1,1

1,1

00

ijigapjij

ijisegjij

PSlG

GSlP

GP

: segment between position i and position j of sequence S : confidence score of assigning s as domain region: confidence score of assigning s as boundary region)(

)(

,

sl

sl

S

gap

seg

ji

boundary region

domain region

Experiments

Datasets: 25 SCOP folds: 2678 training domains and 471 test

chains (189 multi-domain proteins). 40 SCOP super-families: 1917 training domains and

375 test chains (131 multi-domain proteins). Baseline approach:

Align test proteins against PSI-BLAST profiles of the training domains, and use the best aligned regions as domain regions.

Evaluation: Domain label: significant overlap between the true

domain and the predicted domain. Boundaries: both the predicted start and end positions

should be close to the true ones.

Experiments (Cont.)

DatasetFold Dataset

(414 domains)

Super-family Dataset

(288 domains)

Boundary distance

25 50 25 50

PSI-BLAST 8.7%(36) 11.4%(47) 5.6%(16) 15.3%(44)

DP_nogap 16.4%(68) 26.1%(108) 21.5%(62) 26.7%(77)

DP_gap 30.7%(127) 35.8%(148) 25.0%(72) 30.2%(87)

DP_alter 18.8%(78) 25.6%(106) 17.7%(51) 26.4%(76)

At least 75% percent positional overlap between the true domain and the prediction.

Protein Ranking Protein ranking: search protein database for sequences that sha

re an evolutionary or functional relationship with a given query sequence.

Standard protein ranking algorithm: pairwise alignment-based algorithm, PSI-BLAST, can easily detect close homologs.

Pairwise alignment-based algorithms are not effective for remote homolog detection.

Query

Homologous protein

Remote homolog

Other labeled protein


From Local Similarity to Global Structure (RankProp)

Query

2

1

34

5

7

8

6 12367845

Ranking based

on local similarity

Homologous protein

Remote homolog

Other labeled protein

Unlabeled protein

Cluster assumption: proteins with structural or functional relation tend to be in the same cluster in the network.

Diffusion on the protein similarity network to capture cluster structure.

123

678

45

Correct

Ranking

Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004

Noble, Kuang, Leslie, Weston, FEBS 2005

Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005

RankProp (Cont.) Capture global structure with diffusion. Protein similarity network:

Graph nodes: protein sequences in the database Directed edges: weighted by PSI-BLAST e-value Initial ranking score at each node: the similarity to the

query sequence Iterative diffusion operation:

: Initial ranking score : Ranking score at step tK : Normalized connectivity matrix : a parameter for balancing the initial ranking score

and propagation

tt KYYY 01

0Y

tY

MotifProp Motivated by HITS algorithm for pa

ge ranking and NLP algorithms Protein-motif network

Nodes: proteins and motifs Edges: whether a motif is conta

ined in a protein Motifs: patterns/models built on pr

otein segments conserved during evolution.

Often characterize structural/ function properties of a protein.

Examples: eMOTIF, PROSITE, K-mers, BLOCKS…

[Kuang, Weston, Noble, Leslie

Bioinformatics, 2005]

…FYPGKGHTEDNIVVWLPQYNILVGGCLVKSTSAKDLGNVADAYVNEWSTSIENVLKRYRNINAVVPGHGEVG…

Motif Database

Query

MotifProp (Cont.)

MotifProp can identify motif-rich regions derived from motif ranking to help interpret diffusion algorithm.

Low computational cost: protein-motif network is fast to build.

Motifs serve as bridges connecting homologous/remote homologous proteins.

Motif vertices

Query

Protein vertices

In MotifProp, protein nodes and motif nodes enforce their similarity to the query sequence through propagation.

MotifProp:

Normalize affinity matrix H to

Initialize P and F with the initial activation value

Iterate until converge ( )

For all

For all

Diffusion in Protein-motif Network

j

jiji FHPPi~

0 )1(,

j

jiji PHFFi~

0 ')1(,

~

H

)1,0(

Motif vertices

Query

Protein vertices

Experiments

7329 sequences (4246 for training and 3083 for testing) of <95% identity from SCOP 1.59 plus 100,000 proteins from Swiss-Prot.

Motif sets: 4-mers, PROSITE and eMOTIF.

Experiments (Cont.)

Algorithm ROC1 ROC10 ROC50

Sequential MotifProp 0.640 0.663 0.688

k-mer MotifProp 0.621 0.648 0.679

RankProp 0.592 0.667 0.725

PROSITE MotifProp 0.600 0.643 0.664

eMOTIF MotifProp 0.527 0.612 0.666

PSI-BLAST 0.594 0.616 0.641

Conserved Motifs between Remote Homologs We can derive weights of k-mer features from SVM cla

ssifiers trained with profile kernel. MotifProp provides activation values on the k-mer feat

ures after propagation. Both the SVM weights and Motif activation values can

be mapped back to protein sequences to identify conserved structural/functional motifs.

Positional contribution to classification score:

where Δ is the SVM weights or MotifProp activation values on k-mer features.


,,:1:1 kjjxPkjjxS

Mapping Discriminative Regions to Structure (Profile Kernel)

In examined examples, discriminative motif regions correspond to conserved structural features of the protein superfamily

Example: Homeodomain-like protein superfamily.

Ecoli MarA protein (1bl0)

Motif Rich Regions (MotifProp) Motif-rich regions on chain B of arsenite oxidase protein from the ISP prot

ein super-family. The PDB annotation and motif-rich regions are given. The 3D protein structure with motif-rich regions in yellow.

Conclusions &Contributions

Profile-based string kernels exploit compact representation of homology information for better detection of remote homologs.

Dynamic Programming-based approach improves multi-label domain classification and domain boundary detection over PSI-BLAST alignment-based approach.

MotifProp improves protein ranking over PSI-BLAST by network diffusion on protein-motif network.

Interpretation of profile-SVM classifier and MotifProp by motif regions: conserved structural components.

Fast kernels for inexact string matching. Classifiers for protein backbone angle prediction (not

presented).

BackgroundStructural classificationDomain segmentationProtein rankingMotif discovery & angle predictionConclusion & Future Research

SVM-FOLD Web Server

Protein function inference by structural genomics and proteomics Identify functional properties of protein structures with ke

rnel methods, e.g. prediction of protein functional sites and structure-based identification of protein-protein interaction sites.

Protein function inference from proteomics, e.g. protein function prediction based on protein-protein interaction patterns and protein structures.

Protein structure prediction Unified prediction of protein backbone and side chain po

sitions (Phi-Psi angles and rotamers) with energy-based cost function.

Future Research

Acknowledgements: Committee

Tony JebaraDept. of Computer Science, Columbia University

Christina Leslie (advisor)Center for Computational Learning Systems & C2B2, Columbia University

Kathleen Mckeown (chair)Dept. of Computer Science, Columbia University

William Stafford NobleDept. of Genome Science & Dept. of Computer Science, University of Washington

Rocco ServedioDept. of Computer Science, Columbia University

Jason WestonMachine Learning Group, NEC Labs (USA)

Acknowledgements: Collaborators

An-Suei Yang (Genome Research Center, Academia Sinica of Taiwan) Dengyong Zhou (Machine Learning Group, Microsoft)

Yoav Freund (Computer Science Department, UCSD)

Eugene Ie (Computer Science Department, UCSD)

Ke Wang (Computer Science Department, Columbia University)

Wei Chu (Center For Computational Learning Systems, Columbia University)

Kai Wang (Biomedical Informatics Department, Columbia University)

Iain Melvin (Machine Learning Group, NEC)

Girish Yao (Computer Science Department, Columbia University)

Lan Xu (Molecular Biology Department, The Scripps Research Institute)

Publications

Structural classification: Profile kernels (JBCB 2005 and CSB 2004) Inexact marching kernels (JMLR 2004, COLT 2003 & KMCB 2004)

Protein ranking: RankProp (FEBS 2005 and BMC Bioinformatics 2005)

MotifProp (Bioinformatics 2005)

Protein local structure prediction Kernel methods based on sliding-window (Bioinformatics

2004)

Structured output learning (ongoing research)

Protein domain segmentation (In preparation)

Phi-Psi Angles……

(Φ1,Ψ1)

(Φ2,Ψ2)

(Φ3,Ψ3)

(Φ4,Ψ4)

(Φ5,Ψ5)

(Φ6,Ψ6)

(Φ7,Ψ7)

(Φ8,Ψ8)

……

Protein backbone angle prediction

3-D structure

Conformational

States

A

A

A

G

B

B

B

B

B

…

Discretization of Phi-Psi angles

Sliding-window SVM approach [Kuang, Leslie & Yang 2004]

Encode each position independently with sequence information within a length-k window. Conformational

States

A

A

A

B

B

B

B

G

G

E

B

B

B

B

B

A:-3 –4 –4 –4 –3 –4…..

A:0 –1 –1 3 –4 3 4 1…..

B:0 –1 2 1 –3 4 0 –1……

B:-2 –3 –4 –5 –2 4……

B:0 –3 –1 –2 –4 –1……

……To

SVM

Smoothing: use predictions to train a second sets of SVMs

Experiments

Datasets: 697 sequences of 97,365 amino acids with sequence identity < 25 % from PDB (PDB_SELECT25).

Comparison against: LSBSP1: query against local structure-based sequence profile database.

HMMSTR: Hidden Markov Model based on local structural motifs.

RankProp in Genome Browser

Regularization Framework

n

iii

n

jijiij YYYYWYQ

1

0*2

1,

*** ||||1

||||2

1)(

01* ))(1( YWIY

*

**

F

PY

Closed form solution of MotifProp

Related to the regularization framework in Zhou et. al. NIPS 2003

, Where

• Initial Ranking : Final Ranking :

• Normalized Affinity Matrix :

0

00

F

PY

W H

~

'~H0

0

Discussion: Alignment VS K-mers Rangwala and Karypis et. al. 2005 achieved further improv

ement on a previous benchmark dataset on SCOP 1.53 with kernels defined on profile-profile alignment.

Proteins are documents with no punctuation and there is no dictionary!!!

Top SecretClassified

Top SecretClassified

Optimal local alignment detects the most conserved paragraph.

Sensitive for detecting homologous proteins.

Good for remote homologs with one relatively long conserved region.

Alignment is easily interpretable.

Length k subsequences summarize local matches

Can detect discontinuous and disordered conserved regions between remote homologs

Can achieve fast computation

Well defined k-mer feature space for applying learning algorithms

Extracting Discriminative Motif Regions

SVM training determines support vector sequence profiles and their weights: (P(xi), i)

SVM decision hyperplane normal vector:

w = i yi i (P(xi)) Positional contribution to classification score:

Averaged positional score for positive sequences:

w,:1:1 kjjxPkjjxS

kq

avg qjqkjxSjxS1

1:

Map Motif Rich Regions

Map final motif activation values to query sequence to find conserved structural components between remote homologs.

Determination of Protein Structures

X-ray crystallography The interaction of x-rays with electrons arranged in a crystal can produce electron-density map, which can be interpreted to an atomic model. Crystal is very hard to grow.

Nuclear magnetic resonance (NMR)Some atomic nuclei have a magnetic spin. Probed the molecule by radio frequency and get the distances between atoms. Only applicable to small molecules.

Protein structure prediction

Comparative modeling:Where there is a clear sequence relationship between the target structure and one or more known structures.

Fold recognition ('threading'):No sequence homology with known structures. Find consistent folds (remote homology detection).

Ab initio structure prediction(‘de novo’):Deriving structures, approximate or otherwise, from sequence.

RankProp Protein similarity network:

Graph nodes: protein sequences in the database Directed edges: exp(-Sij/σ), where Sij is the PSI-BL

AST e-value between ith protein and jth protein. Initial ranking score at each node: the similarity to th

e query sequence Iterative diffusion operation: Yt+1=Y0+αKYt

Y0: Initial ranking score

Yt: Ranking score at step t

K: Normalized connectivity matrix

α: (0,1) a parameter for balancing the initial ranking score and propagation

Weston, Elisseff, Zhou, Leslie, Noble, PNAS 2004

Noble, Kuang, Leslie, Weston, FEBS 2005

Weston, Kuang, Leslie, Noble, BMC Bioinformatics 2005

Diffusion in protein similarity network

Remote homology can be detected by diffusion from common neighbors in the cluster.

Protein network is expensive to build Hard to interpret the ranking

Sort positional scores: about 40%-50% of positions in positive training sequences contribute 90% of classification score

Peaky positional plots discriminative motifs

Extracting Discriminative Motif Regions

Inferring Protein Structure with Machine Learning Machine learning builds

statistical models from data to learn underlining principles automatically.

Machine Learning techniques are promising for inferring protein structure: large amount of data but little theory.

Solved structures in protein databank (PDB) provide valuable knowledge about protein folding patterns.

Number of total

Structures

New Structures added in a year.

Sequence Alignment for Matching Proteins Smith-Waterman algorithm finds the optimal local

alignment with maximum substitution scores between two sequences by dynamic programming

Alignment-based Algorithms

Smith-Waterman algorithm: find optimal local alignments by dynamic programming

BLAST & PSI-BLAST [Altschul 1997]: fast approximations of Smith-waterman algorithm. Only extend alignment from short identical stretches potentially contained in true matches. Profiles are built to search database iteratively.

SAM-98 [Karplus 1999]: HMM based approach. Build HMM for the query and target sequences. Rank sequences by likelihood computed from HMMs.

Kleinber, 1998

HITS Algorithm for Page Ranking

Good hubs: web pages with many pointers to related pages. Good authorities: web pages pointed to by hubs. Recursive updating enforces good hubs and good authorities.

Hubs Authorities

Let N be the set of edges

Initialize Hub[A] and Aut[A] to 1

Iterate until converge

For all A, Aut[A]=∑(B,A) NHub[B]

For all A, Hub[A]=∑(A,B) NAut[B]

Normalize Aut and Hub

inferring protein structure with discriminative learning and network diffusion

Documents

domain protein

remote homologous proteins

sequence identity

primary sequence

protein sequences

domain family ndomain

domain databasewe

low sequence similarityit