remote homology detection

Remote homology detection

Remote homologs: low sequence similarity, conserved structure/function

A number of databases and tools are available

BLAST, FASTA PDB HOMSTRAD SCOP

Efficient methods are still needed for detecting proteins with similar function and structure

SCOP Database

SCOP: Structural Classification of Proteins

Class Level

Fold Level

Superfamily Level

Family Level

SCOP Database


Class Level

Based on arrangement of secondary structures

all-alpha

all-beta

alpha-and-beta (interspersed)

alpha+beta (segregated)

multidomain

SCOP Database


Class Level

Fold Level

Same secondary structures, arrangements, topology

SCOP Database


Class Level

Fold Level

Superfamily Level

Structure and function suggest common evolutionary origin

SCOP Database


Class Level

Fold Level

Superfamily Level

Family Level

> 30% sequence identity or similar structure/function

SCOP Database

Another representation

proteinprotein

familyfamily

superfamilysuperfamily

Classification problem

Given a query proteinquery protein identify functionally similarfunctionally similar proteins from a database of known proteins

?

Classification problem

Given a query proteinquery protein identify functionally similarfunctionally similar proteins from a database of known proteins

State-of-the-art methods employ Support Vector MachinesSupport Vector Machines (SVM)

Input: Set of labeledlabeled data points (positivepositive or negativenegative)

Output: Model that correctly classifies both the original input data and new unseen data points

SVM finds a hyper-plane that separates the Input Data The new points are classified with respect to the hyper-plane

Support vector machines (SVM)

?

Each data point has to be represented as n-dimensional vector this is called feature vector representationfeature vector representation of the data encodes information about properties of the data

Domain knowledge can/should be used to choose appropriate feature representation

SVM and Data representation

SVM-basedClassifierInput

Data

FeatureRepresentation

SVMTraining

Building SVM-based classifier UnseenData

Outline

Related work article classification protein classification using sequencesequence information

Proposed method protein classification using structurestructure information

Common thread vocabularyvocabulary – a set of possible features feature vectorfeature vector – counts the number of times each feature occurs

Article classification

Categorizing Reuters articles (Joachims, 98)

Feature representation of articles vocabularyvocabulary is the set of all English words feature vectorfeature vector represents the count of each word in the article

Fat doses of red wine extract

help obese mice stay healthy

A daily glass of red wine was

linked to beneficial health

effects a decade ago. Long

suspected of playing a role in

the "French paradox" — a high-

fat diet with no ill effects on

longevity — resveratrol is found

in red wine, sadly in doses

about 300 times lower than in

the mouse study.

0 computer

2 dose

1 diet

0 felony

. . . . . . . .

2 health

0 insurance

0 liquor

2 mouse

. . . . . . . .

1 obese

1 paradox

3 red

3 wine

LVLHSEGWAKVQLVLHVWAKVE . . . . .

Protein classification (sequence)

Categorizing proteins using sequence information (Leslie et al., 04)

Feature representation of proteins vocabularyvocabulary is all k-letter words from the amino acid alphabet feature vectorfeature vector represents the count of each “word” in the protein

0 AAAA

0 AAAC

0 AAAD

0 AAAE

. . . . . . . .

2 LVLH

0 LVL I

0 LVLK

. . . . . . . .

0 WAKS

0 WAKT

2 WAKV

. . . . . . . .

00 3.8 6.5 4.1 4.6 2.7 5.3 2.9 4.1

3.8 03.8 0 3.4 2.8 6.4 2.9 3.3 1.9 3.7

6.5 3.4 06.5 3.4 0 3.7 5.8 2.8 5.7 1.8 2.6

4.1 2.8 3.7 04.1 2.8 3.7 0 3.1 2.2 7.0 4.2 5.3

4.6 6.4 5.8 3.1 04.6 6.4 5.8 3.1 0 3.8 6.5 4.1 3.0

2.7 2.9 2.8 2.2 3.8 02.7 2.9 2.8 2.2 3.8 0 3.4 2.8 4.8

5.3 3.3 5.7 7.0 6.5 3.4 05.3 3.3 5.7 7.0 6.5 3.4 0 3.7 2.1

2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 3.5

4.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 04.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 0

D =

D(i, j) = distance between amino acids i and j

0 3.8 6.5 4.1 4.6 2.7 5.3 2.9 4.1

3.8 0 3.4 2.8 6.4 2.9 3.3 1.9 3.7

6.5 3.4 0 3.7 5.8 2.8 5.7 1.8 2.6

4.1 2.8 3.7 0 3.1 2.2 7.0 4.2 5.3

4.6 6.4 5.8 3.1 0 3.8 6.5 4.1 3.0

2.7 2.9 2.8 2.2 3.8 0 3.4 2.8 4.8

5.3 3.3 5.7 7.0 6.5 3.4 0 3.7 2.1

2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 3.5

4.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 0

D =

D(i, j) = distance between amino acids i and j

Protein classification (structure) Categorizing proteins using structure information (Ilinkin, Ye, in progress)

Feature representation of proteins vocabularyvocabulary is all pairwise distances of k consecutive amino acids feature vectorfeature vector represents the count of each “word” in the protein

Protein classification (structure)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

(3.4, 2.8, 6.4, 3.7, 5.8, 3.1)

(3.6, 4.9, 4.8, 3.5, 2.1, 3.5)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

(3.1, 2.2, 7.0, 3.7, 4.3, 3.6)

(3.7, 5.8, 2.8, 3.1, 2.2, 3.7)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

–

– –– ––

––– –

––––

– ––

––––

– –

–– – –––

– ––

– –– –

++++

++

++ +

+

Experimental setup

Given a query proteinquery protein can we predict its superfamily superfamily (inin or outout)

+ – –

– –

+

+

test

Classifier

Feature Vectorsand

SVM Training

++

+

– –

–

–

train

Split the data into positive (in) positive (in) and negative (out)negative (out) examples

Reserve some of the data for testingtesting ; rest is for trainingtraining the SVM

–

– –– ––

––– –

––––

– ––

––––

– –

–– – –––

– ––

– –– –

++++

++

++ +

+

Results

ROC curve plots true positive ratetrue positive rate vs false positive ratefalse positive rate Area under ROC curve (ROC scoreROC score) is a measure of the quality of

classification area is between 0 and 1 ; closer to 1 is betterarea is between 0 and 1 ; closer to 1 is better

false positive

true

pos

itive

0

20

40

60

80

100

120

140

0.5 0.6 0.7 0.8 0.9 1 1.1

ROC score

nu

mb

er o

f fa

mil

ies

structure

sequence

Sample ROC Curve Experimental Results

Area under ROC

remote homology detection

Documents

feature vector representation

similar proteins

topologyscop databasescop

structurescop databasescop

similar function

original input data

english wordsfeature

sequence identity