remote homology detection

17
Remote homology detection Remote homologs: low sequence similarity, conserved structure/function A number of databases and tools are available BLAST, FASTA PDB HOMSTRAD SCOP Efficient methods are still needed for detecting proteins with similar function and structure

Upload: jariah

Post on 17-Jan-2016

53 views

Category:

Documents


0 download

DESCRIPTION

Remote homology detection. Remote homologs: low sequence similarity, conserved structure/function A number of databases and tools are available BLAST, FASTA PDB HOMSTRAD SCOP Efficient methods are still needed for detecting proteins with similar function and structure. SCOP Database. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Remote homology detection

Remote homology detection

Remote homologs: low sequence similarity, conserved structure/function

A number of databases and tools are available

BLAST, FASTA PDB HOMSTRAD SCOP

Efficient methods are still needed for detecting proteins with similar function and structure

Page 2: Remote homology detection

SCOP Database

SCOP: Structural Classification of Proteins

Class Level

Fold Level

Superfamily Level

Family Level

Page 3: Remote homology detection

SCOP Database

SCOP: Structural Classification of Proteins

Class Level

Based on arrangement of secondary structures

all-alpha

all-beta

alpha-and-beta (interspersed)

alpha+beta (segregated)

multidomain

Page 4: Remote homology detection

SCOP Database

SCOP: Structural Classification of Proteins

Class Level

Fold Level

Same secondary structures, arrangements, topology

Page 5: Remote homology detection

SCOP Database

SCOP: Structural Classification of Proteins

Class Level

Fold Level

Superfamily Level

Structure and function suggest common evolutionary origin

Page 6: Remote homology detection

SCOP Database

SCOP: Structural Classification of Proteins

Class Level

Fold Level

Superfamily Level

Family Level

> 30% sequence identity or similar structure/function

Page 7: Remote homology detection

SCOP Database

Another representation

proteinprotein

familyfamily

superfamilysuperfamily

Page 8: Remote homology detection

Classification problem

Given a query proteinquery protein identify functionally similarfunctionally similar proteins from a database of known proteins

?

Page 9: Remote homology detection

Classification problem

Given a query proteinquery protein identify functionally similarfunctionally similar proteins from a database of known proteins

State-of-the-art methods employ Support Vector MachinesSupport Vector Machines (SVM)

Input: Set of labeledlabeled data points (positivepositive or negativenegative)

Output: Model that correctly classifies both the original input data and new unseen data points

SVM finds a hyper-plane that separates the Input Data The new points are classified with respect to the hyper-plane

Page 10: Remote homology detection

Support vector machines (SVM)

?

Page 11: Remote homology detection

Each data point has to be represented as n-dimensional vector this is called feature vector representationfeature vector representation of the data encodes information about properties of the data

Domain knowledge can/should be used to choose appropriate feature representation

SVM and Data representation

SVM-basedClassifierInput

Data

FeatureRepresentation

SVMTraining

Building SVM-based classifier UnseenData

Page 12: Remote homology detection

Outline

Related work article classification protein classification using sequencesequence information

Proposed method protein classification using structurestructure information

Common thread vocabularyvocabulary – a set of possible features feature vectorfeature vector – counts the number of times each feature occurs

Page 13: Remote homology detection

Article classification

Categorizing Reuters articles (Joachims, 98)

Feature representation of articles vocabularyvocabulary is the set of all English words feature vectorfeature vector represents the count of each word in the article

Fat doses of red wine extract

help obese mice stay healthy

A daily glass of red wine was

linked to beneficial health

effects a decade ago. Long

suspected of playing a role in

the "French paradox" — a high-

fat diet with no ill effects on

longevity — resveratrol is found

in red wine, sadly in doses

about 300 times lower than in

the mouse study.

0 computer

2 dose

1 diet

0 felony

. . . . . . . .

2 health

0 insurance

0 liquor

2 mouse

. . . . . . . .

1 obese

1 paradox

3 red

3 wine

Page 14: Remote homology detection

LVLHSEGWAKVQLVLHVWAKVE . . . . .

Protein classification (sequence)

Categorizing proteins using sequence information (Leslie et al., 04)

Feature representation of proteins vocabularyvocabulary is all k-letter words from the amino acid alphabet feature vectorfeature vector represents the count of each “word” in the protein

0 AAAA

0 AAAC

0 AAAD

0 AAAE

. . . . . . . .

2 LVLH

0 LVL I

0 LVLK

. . . . . . . .

0 WAKS

0 WAKT

2 WAKV

. . . . . . . .

Page 15: Remote homology detection

00 3.8 6.5 4.1 4.6 2.7 5.3 2.9 4.1

3.8 03.8 0 3.4 2.8 6.4 2.9 3.3 1.9 3.7

6.5 3.4 06.5 3.4 0 3.7 5.8 2.8 5.7 1.8 2.6

4.1 2.8 3.7 04.1 2.8 3.7 0 3.1 2.2 7.0 4.2 5.3

4.6 6.4 5.8 3.1 04.6 6.4 5.8 3.1 0 3.8 6.5 4.1 3.0

2.7 2.9 2.8 2.2 3.8 02.7 2.9 2.8 2.2 3.8 0 3.4 2.8 4.8

5.3 3.3 5.7 7.0 6.5 3.4 05.3 3.3 5.7 7.0 6.5 3.4 0 3.7 2.1

2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 3.5

4.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 04.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 0

D =

D(i, j) = distance between amino acids i and j

0 3.8 6.5 4.1 4.6 2.7 5.3 2.9 4.1

3.8 0 3.4 2.8 6.4 2.9 3.3 1.9 3.7

6.5 3.4 0 3.7 5.8 2.8 5.7 1.8 2.6

4.1 2.8 3.7 0 3.1 2.2 7.0 4.2 5.3

4.6 6.4 5.8 3.1 0 3.8 6.5 4.1 3.0

2.7 2.9 2.8 2.2 3.8 0 3.4 2.8 4.8

5.3 3.3 5.7 7.0 6.5 3.4 0 3.7 2.1

2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 3.5

4.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 0

D =

D(i, j) = distance between amino acids i and j

Protein classification (structure) Categorizing proteins using structure information (Ilinkin, Ye, in progress)

Feature representation of proteins vocabularyvocabulary is all pairwise distances of k consecutive amino acids feature vectorfeature vector represents the count of each “word” in the protein

Protein classification (structure)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

(3.4, 2.8, 6.4, 3.7, 5.8, 3.1)

(3.6, 4.9, 4.8, 3.5, 2.1, 3.5)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

(3.1, 2.2, 7.0, 3.7, 4.3, 3.6)

(3.7, 5.8, 2.8, 3.1, 2.2, 3.7)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

Page 16: Remote homology detection

– –– ––

––– –

––––

– ––

––––

– –

–– – –––

– ––

– –– –

++++

++

++ +

+

Experimental setup

Given a query proteinquery protein can we predict its superfamily superfamily (inin or outout)

+ – –

– –

+

+

test

Classifier

Feature Vectorsand

SVM Training

++

+

– –

train

Split the data into positive (in) positive (in) and negative (out)negative (out) examples

Reserve some of the data for testingtesting ; rest is for trainingtraining the SVM

– –– ––

––– –

––––

– ––

––––

– –

–– – –––

– ––

– –– –

++++

++

++ +

+

Page 17: Remote homology detection

Results

ROC curve plots true positive ratetrue positive rate vs false positive ratefalse positive rate Area under ROC curve (ROC scoreROC score) is a measure of the quality of

classification area is between 0 and 1 ; closer to 1 is betterarea is between 0 and 1 ; closer to 1 is better

false positive

true

pos

itive

0

20

40

60

80

100

120

140

0.5 0.6 0.7 0.8 0.9 1 1.1

ROC score

nu

mb

er o

f fa

mil

ies

structure

sequence

Sample ROC Curve Experimental Results

Area under ROC