protein homology detection using string alignment kernels jean-phillippe vert, tatsuya akutsu
Post on 21-Dec-2015
216 views
TRANSCRIPT
Protein Homology Detection Using String Alignment
Kernels
Jean-Phillippe Vert, Tatsuya Akutsu
Problem: classification of protein sequence data into families and superfamilies
Motivation: Many proteins have been sequenced, but often structure/function remains unknown
Motivation: infer structure/function from sequence-based classification
Learning Sequence Based Protein Classification
>1A3N:A HEMOGLOBIN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>1A3N:B HEMOGLOBIN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>1A3N:C HEMOGLOBIN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>1A3N:D HEMOGLOBIN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
Sequences for four chains of human hemoglobin
Tertiary Structure
Function: oxygen transport
Sequence Data Versus Structure and function
SCOP: Structural Classification of Proteins
Interested in superfamily-level homology – remote evolutionary relationship
Difficult !!
Structural Hierarchy
Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise
Focus on remote homology detection Use supervised learning approach to
train a classifier
Labeled TrainingSequences
Classification Rule
Learning Algorithm
Learning Problem
Generative model approach Build a generative model for a single protein
family; classify each candidate sequence based on its fit to the model
Only uses positive training sequences
Discriminative approach Learning algorithm tries to learn decision
boundary between positive and negative examples
Uses both positive and negative training sequences
Two supervised learning approaches to classification
Class
Fold
Super Family
Family
HMM, PSI-BLAST, SVM
SW, BLAST, FASTA
Threading
Secondary Structure Prediction
Targets of the current methods
Discriminative approachTrain on both positive and negative examples to learn classifier
Modern computational learning theory• Goal: learn a classifier that generalizes well to new examples• Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”
Discriminative Learning
Want to define feature map from space of protein sequences to vector space
Goals: Computational efficiency Competitive performance with known
methods No reliance on generative model –
general method for sequence-based classification problems
SVM for protein classification
Feature vector from HMM Fisher kernel (Jaakkola et al., 2000) Marginalized kernel (Tsuda et al., 2002)
Feature vector from sequence Spectrum kernel (Leslie et al., 2002) Mismatch kernel (Leslie et al., 2003)
Feature vector from other score SVM pairwise (Liao & Noble, 2002)
Summary of the current kernel methods
Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution.
It can not be used as kernel because of lack of positive definiteness.
A family of local alignment (LA) kernels that mimic SW score are presented .
String Alignment Kernels
Choose Feature Vector representation
Get Kernel by inner product of vectors
Measure similarity Get valid kernel
LA Kernel
Other Kernels
LA Kernels
Pair score Kaβ (x,y)
Gap kernel Kgβ (x,y) for penalty gap model
otherwisw)),(exp(
1||or 1||if0),(
yxs
yxyxKa
)1()(,0)0(
)|)(||)(|(exp),(
nedngg
ygxgyxKa
ただし、
with
d is gap opening and e is extension costs
Β>=0, s is a symmetric similarity score.
LA Kernels
Kernel convolution:
For n>=1, the string kernel can be expressed as
yyyxxx
yxKyxKyxKK2121
2211,
2121 ),(),(),(
100
1
0)( ),(
Ka
n
gan KKKKKyxK ただし、 K0=1
K0 is initial part, succession of n aligned residues Ka β
with n-1 possible gap Kg β and a terminal part K0.
LA Kernels
0
)( ),(),(i
iLA yxKyxK
It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels
V F
Ka β
F L L D D R L - - V L L V - - E K L G A - -
T T
Kg β Kg
β Kg β Ka β Ka β Ka β
LA Kernels
π: local alignment p(x,y,π): score of local
alignment π over x,y. Π: set of all possible
local alignment over x,y.
)),,(exp(maxln
),,(max),(
),(
1
),(
yxp
yxpyxSW
yx
yx
),(
)),,(exp(),(yx
LA yxpyxK
),()),(ln(lim 1 yxSWyxKLA
LA with SW score
1. SW only keep the best alignment instead of sum of alignment of x,y.
2. Logrithm can destroy the property of being postive definite.
Why SW can not be kernel
AWGE A - GE
HAWGEG
AGEHV
配列 x
配列 y
SWスコア
LAカーネル
AWGE A - GE
AWGE AG - E
HAWGE A -G - E
HAWGE - G A -G EHV -
π 1
π 2
π 3
π 4
p(x,y,π )=0.003
p(x,y,π )=0.001
p(x,y,π )=0.0006
p(x,y,π )=0.0001
LA Kernel
SW score
Example
SVM-pairwise LA kernel
Inner Product(0.9, 0.05, 0.3,
0.2)
0.227 0.253
Pair HMM
x y
x y
(0.2, 0.3, 0.1, 0.01)
SW Score
It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM.
Diagonal Dominant Issue
Diagonal Dominant Issue
(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal.
(2) The empirical kernel map LA-ekm
Implementation The computation of the kernel [and therefore of
] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm.
Normaliztion
Dataset 4352 sequences extracted from the Astral database (
www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.
Methods
ROC Curve
ROC Curve
Summary for the kernels