protein homology detection using string alignment kernels jean-phillippe vert, tatsuya akutsu

Protein Homology Detection Using String Alignment

Kernels

Jean-Phillippe Vert, Tatsuya Akutsu

Problem: classification of protein sequence data into families and superfamilies

Motivation: Many proteins have been sequenced, but often structure/function remains unknown

Motivation: infer structure/function from sequence-based classification

Learning Sequence Based Protein Classification

>1A3N:A HEMOGLOBIN

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>1A3N:B HEMOGLOBIN

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

>1A3N:C HEMOGLOBIN

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA

VHASLDKFLASVSTVLTSKYR

>1A3N:D HEMOGLOBIN

VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV

KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK

EFTPPVQAAYQKVVAGVANALAHKYH

Sequences for four chains of human hemoglobin

Tertiary Structure

Function: oxygen transport

Sequence Data Versus Structure and function

SCOP: Structural Classification of Proteins

Interested in superfamily-level homology – remote evolutionary relationship

Difficult !!

Structural Hierarchy

Reduce to binary classification problem: positive (+) if example belongs to a family (e.g. G proteins) or superfamily (e.g. nucleoside triphosphate hydrolases), negative (-) otherwise

Focus on remote homology detection Use supervised learning approach to

train a classifier

Labeled TrainingSequences

Classification Rule

Learning Algorithm

Learning Problem

Generative model approach Build a generative model for a single protein

family; classify each candidate sequence based on its fit to the model

Only uses positive training sequences

Discriminative approach Learning algorithm tries to learn decision

boundary between positive and negative examples

Uses both positive and negative training sequences

Two supervised learning approaches to classification

Class

Fold

Super Family

Family

HMM, PSI-BLAST, SVM

SW, BLAST, FASTA

Threading

Secondary Structure Prediction

Targets of the current methods

Discriminative approachTrain on both positive and negative examples to learn classifier

Modern computational learning theory• Goal: learn a classifier that generalizes well to new examples• Do not use training data to estimate parameters of probability distribution – “curse of dimensionality”

Discriminative Learning

Want to define feature map from space of protein sequences to vector space

Goals: Computational efficiency Competitive performance with known

methods No reliance on generative model –

general method for sequence-based classification problems

SVM for protein classification

Feature vector from HMM Fisher kernel　 (Jaakkola et al., 2000) Marginalized kernel (Tsuda et al., 2002)

Feature vector from sequence Spectrum kernel (Leslie et al., 2002) Mismatch kernel (Leslie et al., 2003)

Feature vector from other score SVM pairwise (Liao & Noble, 2002)

Summary of the current kernel methods

Observation: SW alignment score provides measure of similarity with biological knowledge on protein evolution.

It can not be used as kernel because of lack of positive definiteness.

A family of local alignment (LA) kernels that mimic SW score are presented .

String Alignment Kernels

Choose Feature Vector representation

Get Kernel by inner product of vectors

Measure similarity Get valid kernel

LA Kernel

Other Kernels

LA Kernels

Pair score Kaβ (x,y)

Gap kernel Kgβ (x,y) for penalty gap model

otherwisw)),(exp(

1||or 1||if0),(

yxs

yxyxKa

)1()(,0)0(

)|)(||)(|(exp),(

nedngg

ygxgyxKa

ただし、

with

d is gap opening and e is extension costs

Β>=0, s is a symmetric similarity score.

LA Kernels

Kernel convolution:

For n>=1, the string kernel can be expressed as

yyyxxx

yxKyxKyxKK2121

2211,

2121 ),(),(),(

100

1

0)( ),(

Ka

n

gan KKKKKyxK ただし、 K0=1

K0 is initial part, succession of n aligned residues Ka β

with n-1 possible gap Kg β and a terminal part K0.

LA Kernels

0

)( ),(),(i

iLA yxKyxK

It is convergent for any x and y because of finite number of non-null terms. It is a point-wise limit of Mercer Kernels

V F

Ka β

F L L D D R L - - V L L V - - E K L G A - -

T T

Kｇ β Kｇ

β Kｇ β Ka β Ka β Ka β

LA Kernels

π： local alignment p(x,y,π): score of local

alignment π over x,y. Π： set of all possible

local alignment over x,y.

)),,(exp(maxln

),,(max),(

),(

1

),(

yxp

yxpyxSW

yx

yx

),(

)),,(exp(),(yx

LA yxpyxK

),()),(ln(lim 1 yxSWyxKLA

LA with SW score

1. SW only keep the best alignment instead of sum of alignment of x,y.

2. Logrithm can destroy the property of being postive definite.

Why SW can not be kernel

AWGE A - GE

HAWGEG

AGEHV

配列 x

配列 y

SWスコア

LAカーネル

AWGE A - GE

AWGE AG - E

HAWGE A -G - E

HAWGE - G A -G EHV -

π 1

π 2

π 3

π 4

p(x,y,π )=0.003

p(x,y,π )=0.001

p(x,y,π )=0.0006

p(x,y,π )=0.0001

LA Kernel

SW score

Example

SVM-pairwise LA kernel

Inner Product(0.9, 0.05, 0.3,

0.2)

0.227 0.253

Pair HMM

x y

x y

(0.2, 0.3, 0.1, 0.01)

SW Score

It is the fact that K(x,x) is easily orders of magnitude larger than K(x,y) of similar sequence which bias the performance of SVM.

Diagonal Dominant Issue

Diagonal Dominant Issue

(1) The eigen kernel LA-eig : a. By subtracting from the diagonal the smallest negative eigenvalue of the training Gram matrix, if there are negative eigenvalues. b. LA-eig, is equal to except eventually on the diagonal.

(2) The empirical kernel map LA-ekm

Implementation The computation of the kernel [and therefore of

] with a complexity in O(|x| · |y|), Using dynamic programming by a slight modification of the SW algorithm.

Normaliztion

Dataset 4352 sequences extracted from the Astral database (

www.cs.columbia.edu/compbio/svmpairwise), grouped into families and superfamilies.

Methods

ROC Curve

Summary for the kernels

protein homology detection using string alignment kernels jean-phillippe vert, tatsuya akutsu

Documents

protein classification

function slide

structural hierarchy

binary classification

negative training sequences

space of protein sequences

tatsuya akutsu slide

candidate sequence