remote homology detection
DESCRIPTION
Remote homology detection. Remote homologs: low sequence similarity, conserved structure/function A number of databases and tools are available BLAST, FASTA PDB HOMSTRAD SCOP Efficient methods are still needed for detecting proteins with similar function and structure. SCOP Database. - PowerPoint PPT PresentationTRANSCRIPT
Remote homology detection
Remote homologs: low sequence similarity, conserved structure/function
A number of databases and tools are available
BLAST, FASTA PDB HOMSTRAD SCOP
Efficient methods are still needed for detecting proteins with similar function and structure
SCOP Database
SCOP: Structural Classification of Proteins
Class Level
Fold Level
Superfamily Level
Family Level
SCOP Database
SCOP: Structural Classification of Proteins
Class Level
Based on arrangement of secondary structures
all-alpha
all-beta
alpha-and-beta (interspersed)
alpha+beta (segregated)
multidomain
SCOP Database
SCOP: Structural Classification of Proteins
Class Level
Fold Level
Same secondary structures, arrangements, topology
SCOP Database
SCOP: Structural Classification of Proteins
Class Level
Fold Level
Superfamily Level
Structure and function suggest common evolutionary origin
SCOP Database
SCOP: Structural Classification of Proteins
Class Level
Fold Level
Superfamily Level
Family Level
> 30% sequence identity or similar structure/function
SCOP Database
Another representation
proteinprotein
familyfamily
superfamilysuperfamily
Classification problem
Given a query proteinquery protein identify functionally similarfunctionally similar proteins from a database of known proteins
?
Classification problem
Given a query proteinquery protein identify functionally similarfunctionally similar proteins from a database of known proteins
State-of-the-art methods employ Support Vector MachinesSupport Vector Machines (SVM)
Input: Set of labeledlabeled data points (positivepositive or negativenegative)
Output: Model that correctly classifies both the original input data and new unseen data points
SVM finds a hyper-plane that separates the Input Data The new points are classified with respect to the hyper-plane
Support vector machines (SVM)
?
Each data point has to be represented as n-dimensional vector this is called feature vector representationfeature vector representation of the data encodes information about properties of the data
Domain knowledge can/should be used to choose appropriate feature representation
SVM and Data representation
SVM-basedClassifierInput
Data
FeatureRepresentation
SVMTraining
Building SVM-based classifier UnseenData
Outline
Related work article classification protein classification using sequencesequence information
Proposed method protein classification using structurestructure information
Common thread vocabularyvocabulary – a set of possible features feature vectorfeature vector – counts the number of times each feature occurs
Article classification
Categorizing Reuters articles (Joachims, 98)
Feature representation of articles vocabularyvocabulary is the set of all English words feature vectorfeature vector represents the count of each word in the article
Fat doses of red wine extract
help obese mice stay healthy
A daily glass of red wine was
linked to beneficial health
effects a decade ago. Long
suspected of playing a role in
the "French paradox" — a high-
fat diet with no ill effects on
longevity — resveratrol is found
in red wine, sadly in doses
about 300 times lower than in
the mouse study.
0 computer
2 dose
1 diet
0 felony
. . . . . . . .
2 health
0 insurance
0 liquor
2 mouse
. . . . . . . .
1 obese
1 paradox
3 red
3 wine
LVLHSEGWAKVQLVLHVWAKVE . . . . .
Protein classification (sequence)
Categorizing proteins using sequence information (Leslie et al., 04)
Feature representation of proteins vocabularyvocabulary is all k-letter words from the amino acid alphabet feature vectorfeature vector represents the count of each “word” in the protein
0 AAAA
0 AAAC
0 AAAD
0 AAAE
. . . . . . . .
2 LVLH
0 LVL I
0 LVLK
. . . . . . . .
0 WAKS
0 WAKT
2 WAKV
. . . . . . . .
00 3.8 6.5 4.1 4.6 2.7 5.3 2.9 4.1
3.8 03.8 0 3.4 2.8 6.4 2.9 3.3 1.9 3.7
6.5 3.4 06.5 3.4 0 3.7 5.8 2.8 5.7 1.8 2.6
4.1 2.8 3.7 04.1 2.8 3.7 0 3.1 2.2 7.0 4.2 5.3
4.6 6.4 5.8 3.1 04.6 6.4 5.8 3.1 0 3.8 6.5 4.1 3.0
2.7 2.9 2.8 2.2 3.8 02.7 2.9 2.8 2.2 3.8 0 3.4 2.8 4.8
5.3 3.3 5.7 7.0 6.5 3.4 05.3 3.3 5.7 7.0 6.5 3.4 0 3.7 2.1
2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 3.5
4.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 04.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 0
D =
D(i, j) = distance between amino acids i and j
0 3.8 6.5 4.1 4.6 2.7 5.3 2.9 4.1
3.8 0 3.4 2.8 6.4 2.9 3.3 1.9 3.7
6.5 3.4 0 3.7 5.8 2.8 5.7 1.8 2.6
4.1 2.8 3.7 0 3.1 2.2 7.0 4.2 5.3
4.6 6.4 5.8 3.1 0 3.8 6.5 4.1 3.0
2.7 2.9 2.8 2.2 3.8 0 3.4 2.8 4.8
5.3 3.3 5.7 7.0 6.5 3.4 0 3.7 2.1
2.9 1.9 1.8 4.2 4.1 2.8 3.7 0 3.5
4.1 3.7 2.6 5.3 3.0 4.8 2.1 3.5 0
D =
D(i, j) = distance between amino acids i and j
Protein classification (structure) Categorizing proteins using structure information (Ilinkin, Ye, in progress)
Feature representation of proteins vocabularyvocabulary is all pairwise distances of k consecutive amino acids feature vectorfeature vector represents the count of each “word” in the protein
Protein classification (structure)
(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)
(3.4, 2.8, 6.4, 3.7, 5.8, 3.1)
(3.6, 4.9, 4.8, 3.5, 2.1, 3.5)
(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)
(3.1, 2.2, 7.0, 3.7, 4.3, 3.6)
(3.7, 5.8, 2.8, 3.1, 2.2, 3.7)
(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)
(3.8, 6.5, 4.1, 3.4, 2.8, 3.7)
–
– –– ––
––– –
––––
– ––
––––
– –
–– – –––
– ––
– –– –
++++
++
++ +
+
Experimental setup
Given a query proteinquery protein can we predict its superfamily superfamily (inin or outout)
+ – –
– –
+
+
test
Classifier
Feature Vectorsand
SVM Training
++
+
– –
–
–
train
Split the data into positive (in) positive (in) and negative (out)negative (out) examples
Reserve some of the data for testingtesting ; rest is for trainingtraining the SVM
–
– –– ––
––– –
––––
– ––
––––
– –
–– – –––
– ––
– –– –
++++
++
++ +
+
Results
ROC curve plots true positive ratetrue positive rate vs false positive ratefalse positive rate Area under ROC curve (ROC scoreROC score) is a measure of the quality of
classification area is between 0 and 1 ; closer to 1 is betterarea is between 0 and 1 ; closer to 1 is better
false positive
true
pos
itive
0
20
40
60
80
100
120
140
0.5 0.6 0.7 0.8 0.9 1 1.1
ROC score
nu
mb
er o
f fa
mil
ies
structure
sequence
Sample ROC Curve Experimental Results
Area under ROC