density-based classification of protein structures using iterative tm-score

Density-Based Classification of Protein Structures Using Iterative TM-score

David Hoksza, Jakub Galgonek

Charles University in PragueDepartment of Software Engineering

Czech Republic

CSBW 2009 2

Presentation Outline

Biological background

Similarity search in protein structure databases

Method feature vectors’ extraction feature mapping scoring

Experimental results

Conclusion

CSBW 2009 3

Biological Background Proteins

molecule translated from mRNA in ribosomes

DNA → RNA → protein sequence of amino acids (20 AAs) AAs coded by codons (triplet of nucleotides)

Function of a protein derived from its three dimensional structure → similar proteins show similar functions

Identifying protein structure → finding similar proteins → getting clue to the function

CSBW 2009 4

Similarity Search in Protein Databases

Similarity between a pair of proteins1. finding mapping (alignment) of atoms2. transformation to minimize mutual distance3. computing distance of the superposed structures

DALI, CE, TM-align, Vorometric, Vorolign, PPM, …

Classification SCOP (Structural Classification of Proteins) manually curated hierarchical classification

family → superfamily → fold → class

need for automated classification

CSBW 2009 5

Feature Extraction and Comparison

Features based on density (inter-residual distances) of atoms

distances among Cα atoms used each AA represents one feature → protein p consists of |

p| features various semantics used

based on clustering Cα atoms into rings

distance between pair of features Euclidean distance weighted Euclidean distance

CSBW 2009 6

Feature Extraction Semantics Features

n-dimensional vectors of real numbers

AA ≈ viewpoint → VPT (viewpoint tag)

sDens density of AAs in rings with

a predefined width sDensSSE

enhanced with SSE information

sRad widths of rings containing

predefined percentage of AAs

sRadSSE enhanced with SSE

information sDir

number of AAs in a ring pointing from the viepoint

sDens enhanced with direction information

CSBW 2009 7

Finding an Alignment

Smith-Waterman dynamic programming solution (local alignment)

originally for sequence alignment finding optimal local alignment

according to given distance (substitution) matrix

L0,j = 0, Li,0 = 0

L … DP matrix s … distance matrix σ … gap penalty

max(Li,j) … value of optimal alignment

0

],[max

1,1

,1

1,

,jiji

ji

ji

ji basL

L

L

L

Modifications Using modified distances of features in

the substitution matrix

μ … mean of s[i,j]

Structure-specific gap costs

OGP … open gap penalty EGP … extend gap

penalty

4

3],[],[

],[],[

])[],[(],[21

jisjis

jis

cjis

jVPTiVPTdjis pp

2

OGPEGP

OGP

CSBW 2009 8

Superposition + Scoring Scoring

Superposition achieved by iterative use of TM-score rotation algorithm

Algorithm tries to find a subset of alignment whose RMSD superposition maximizes the TM-score

Various initial subsets are evaluated lenghts … L, L/2, max(L/25,4)

For each, a superposition is created and a new alignment consisting of spatially close residues is created

Iteration continues until stabilization or max. number of iteration is achieved

Superposition is used for reconstruction of the alignment achieved by dynamic programming (iteratively)

AL

iiQ

ddL

TMscore1

2

0

1

111

2

0

1

1],[

d

djis

ij

Superposition Improvements Prefiltration using non-

iterative FAST* heuristics top kNN filtering range filtering

Belt-based restriction of dynamic programming when reconstructing original alignment Sticking more closely to

the original alignment

CSBW 2009 9

Using only single initial alignment of length L

For similar proteins (which are important for classification) the deviation from original algorithm is small

CSBW 2009 10

Experimental Results ASTRAL compendium

Test set ASTRAL 1.67 only proteins not present in ASTRAL95 1.65 having < 30% sequence identity and <30

identical residues 979 proteins

Database set ASTRAL25 1.65 4357 proteins

Evaluation of

classification accuracy according to SCOP classification of a query is based on the classification of the most similar

protein classification is correct when it is in agreement with SCOP classification

speed

CSBW 2009 11

Comparison of Classification Methods

Speed Evaluation

CSBW 2009 12

CSBW 2009 13

Conclusion

We have proposed Alignment of protein structures

distance and density of Cα atoms alignment based on feature distances

Improved version of superposing algorithm (TM) Accuracy Speed

Experimental results best results among methods according to SCOP superfamily and

fold accuracy 95.8% superfamily classification accuracy 98.1% fold classification accuracy

density-based classification of protein structures using iterative tm-score

Documents

feature protein p

protein databasessimilarity

subset of alignment

new alignment

iterative use of tm

gap penaltycsbw

rmsd superposition

iterative tmscoredavid