protein homology detection using string alignment kernels jean-phillippe vert, tatsuya akutsu

25
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Protein Homology Detection Using String Alignment

Kernels

Jean-Phillippe Vert, Tatsuya Akutsu

Page 2: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Problem: classification of protein sequence data into families and superfamilies

Motivation: Many proteins have been sequenced, but often structure/function remains unknown

Motivation: infer structure/function from sequence-based classification

Learning Sequence Based Protein Classification

Page 3: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

>1A3N:A HEMOGLOBIN

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>1A3N:B HEMOGLOBIN

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

>1A3N:C HEMOGLOBIN

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>1A3N:D HEMOGLOBIN

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

Sequences for four chains of human hemoglobin

Tertiary Structure

Function: oxygen transport

Sequence Data Versus Structure and function

Page 4: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

SCOP: Structural Classification of Proteins

Interested in superfamily-level homology – remote evolutionary relationship

Difficult !!

Structural Hierarchy

Page 5: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise

Focus on remote homology detection Use supervised learning approach to

train a classifier

Labeled TrainingSequences

Classification Rule

Learning Algorithm

Learning Problem

Page 6: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Generative model approach Build a generative model for a single protein

family; classify each candidate sequence based on its fit to the model

Only uses positive training sequences

Discriminative approach Learning algorithm tries to learn decision

boundary between positive and negative examples

Uses both positive and negative training sequences

Two supervised learning approaches to classification

Page 7: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Class

Fold

Super Family

Family

HMM, PSI-BLAST, SVM

SW, BLAST, FASTA

Threading

Secondary Structure Prediction

Targets of the current methods

Page 8: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Discriminative approachTrain on both positive and negative examples to learn classifier

Modern computational learning theory• Goal: learn a classifier that generalizes well to new examples• Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”

Discriminative Learning

Page 9: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Want to define feature map from space of protein sequences to vector space

Goals: Computational efficiency Competitive performance with known

methods No reliance on generative model –

general method for sequence-based classification problems

SVM for protein classification

Page 10: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Feature vector from HMM Fisher kernel  (Jaakkola et al., 2000) Marginalized kernel (Tsuda et al., 2002)

Feature vector from sequence Spectrum kernel (Leslie et al., 2002) Mismatch kernel (Leslie et al., 2003)

Feature vector from other score SVM pairwise (Liao & Noble, 2002)

Summary of the current kernel methods

Page 11: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution.

It can not be used as kernel because of lack of positive definiteness.

A family of local alignment (LA) kernels that mimic SW score are presented .

String Alignment Kernels

Page 12: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Choose Feature Vector representation

Get Kernel by inner product of vectors

Measure similarity Get valid kernel

LA Kernel

Other Kernels

LA Kernels

Page 13: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Pair score Kaβ (x,y)

Gap kernel Kgβ (x,y) for penalty gap model

otherwisw)),(exp(

1||or 1||if0),(

yxs

yxyxKa

)1()(,0)0(

)|)(||)(|(exp),(

nedngg

ygxgyxKa

ただし、

with

d is gap opening and e is extension costs

Β>=0, s is a symmetric similarity score.

LA Kernels

Page 14: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Kernel convolution:

For n>=1, the string kernel can be expressed as

yyyxxx

yxKyxKyxKK2121

2211,

2121 ),(),(),(

100

1

0)( ),(

Ka

n

gan KKKKKyxK ただし、 K0=1

K0 is initial part, succession of n aligned residues Ka β

with n-1 possible gap Kg β and a terminal part K0.

LA Kernels

Page 15: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

0

)( ),(),(i

iLA yxKyxK

It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels

V F

Ka β

F L L D D R L - - V L L V - - E K L G A - -

T T

Kg β Kg

β Kg β Ka β Ka β Ka β

LA Kernels

Page 16: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

π: local alignment p(x,y,π): score of local

alignment π over x,y. Π: set of all possible

local alignment over x,y.

)),,(exp(maxln

),,(max),(

),(

1

),(

yxp

yxpyxSW

yx

yx

),(

)),,(exp(),(yx

LA yxpyxK

),()),(ln(lim 1 yxSWyxKLA

LA with SW score

Page 17: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

1. SW only keep the best alignment instead of sum of alignment of x,y.

2. Logrithm can destroy the property of being postive definite.

Why SW can not be kernel

Page 18: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

AWGE A - GE

HAWGEG

AGEHV

配列 x

配列 y

SWスコア

LAカーネル

AWGE A - GE

AWGE AG - E

HAWGE A -G - E

HAWGE - G A -G EHV -

π 1

π 2

π 3

π 4

p(x,y,π )=0.003

p(x,y,π )=0.001

p(x,y,π )=0.0006

p(x,y,π )=0.0001

LA Kernel

SW score

Example

Page 19: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

SVM-pairwise LA kernel

Inner Product(0.9, 0.05, 0.3,

0.2)

0.227 0.253

Pair HMM

x y

x y

(0.2, 0.3, 0.1, 0.01)

SW Score

Page 20: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM.

Diagonal Dominant Issue

Page 21: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Diagonal Dominant Issue

(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal.

(2) The empirical kernel map LA-ekm

Page 22: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Implementation The computation of the kernel [and therefore of

] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm.

Normaliztion

Dataset 4352 sequences extracted from the Astral database (

www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.

Methods

Page 23: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

ROC Curve

Page 24: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

ROC Curve

Page 25: Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu

Summary for the kernels