11/9/99ictai-99, chicago1 protein secondary structure prediction using data mining tool c5 meiliu lu...

24
11/9/99 ICTAI-99, Chicago 1 Protein Secondary Protein Secondary Structure Prediction Structure Prediction Using Data Mining Tool Using Data Mining Tool C5 C5 Meiliu Lu , Du Zhang , Hongjun Xu , Ken Tse-yau Lau , and Li Lu § Dept. of Computer Science California State University Intel Corporation, Folsom CA § Sierra Systems Consultants Inc., Washington DC

Upload: jocelyn-odoms

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 1

Protein Secondary Structure Protein Secondary Structure Prediction Using Data Mining Prediction Using Data Mining

Tool C5Tool C5

Meiliu Lu†, Du Zhang†, Hongjun Xu†, Ken Tse-yau Lau‡, and Li Lu§

† Dept. of Computer ScienceCalifornia State University

‡ Intel Corporation, Folsom CA

§ Sierra Systems Consultants Inc., Washington DC

Page 2: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 2

Introduction

• Advancement of medical sciences depends critically on understanding of structures of proteins, the fundamental molecules for all living organisms.

• Proteins have different structures based upon their locations (intracellular, extracellular, membrane, cytosolic, neuclear ) and functions (structural, enzyme, or antibodies, etc.)

• All protein molecules are polymers built up from 20 different amino acid residues linked end to end by peptide bonds.

Page 3: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 3

Protein Structures

• Primary structure is the linear sequences of amino acid.

• Secondary structure is the spatial relationship of amino acid residues that are close to one another in the linear sequence.

• Tertiary structure is the spatial relationship of residues that are far apart in the linear sequence.

• Quaternary structure is the way some proteins are packed together to form polypeptide chain.

Page 4: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 4

The Secondary Structure

• The function of every protein depends on its tertiary (3D) structure.

• Secondary structure plays a pivotal role between the final 3D structure and the linear amino acid sequence of a protein.

• Determining a protein’s secondary structure from its primary one would greatly help us unlock its 3D structure.

Page 5: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 5

Types of Secondary Structure

-helix: a rod-like structure.-sheet: several regions of the polypeptide

chain.

• turns: part where direction of the polypeptide chain is changed.

• coil: any part of the polypeptide chains not belonging to the above three.

Page 6: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 6

Protein Structure Example 1: p21Ras

Page 7: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 7

Protein Structure Example 2: MHC1

Page 8: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 8

State-of-the-Art in Protein Secondary Structure prediction

• Physical methods such as x-ray crystallography, or nuclear magnetic resonance, slow and expensive.

• There are 3 broad groups of secondary structure prediction methods:

– empirical statistical methods, accuracy around 50%

– stereochemical criteria based methods, accuracy 50%

– machine learning based methods, accuracy up to 70-

80%

Page 9: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 9

The Challenge

• The slow experimental determination of 3D structure vs. the fast accumulation of amino acid sequence data.

• Different amino acid sequences may yield similar 3D structure.

• Very difficult to predict 3D structure from its sequence of an unknown protein.

Page 10: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 10

Our Research Experiment

• To predict the secondary structure of an unknown protein, Spermidine/Spermine N1-Acetyltransferase (SSAT), a target of cancer chemotherapy.

• A machine learning tool called C5 (by J. Ross Quinlan), which is based on a decision tree learning method, is used for the prediction task.

Page 11: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 11

Comparison of ML Tools

Page 12: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 12

Prediction Considerations

• Use of functional similarity and sequence homology in selecting training proteins.

• Incorporation of amino acid hydrophobicity into the process.

• Choices of training set sizes and sequence attribute sizes.

Page 13: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 13

Selections of Training Proteins

• A set (FS) of 23 known proteins that are functionally similar to SSAT is selected.

• A set (SH) of 32 known proteins that have sequence homology to SSAT is selected.

• A third set (MX) is constructed that consists of proteins from both FS and SH.

Page 14: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 14

Incorporation of Hydrophobicity

• Hydrophobic character of each amino acid residue is incorporated into the prediction process.

• The levels considered in our experiments are: none (NH), residual-level (RH) and atomic-level (AH.)

• Two methods used in calculating the values.

Page 15: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 15

Decision Tree Based Learning• Collect a large set of examples.• Divide it into two disjoint sets: training set (TR)

and test set (TT). • Use the learning algorithm with TR to generate

decision trees (if-then rules).• Measure the percentage of examples in TT that are

correctly classified by the trees (rules).• Repeat the above steps for diff. sizes of TR and

diff. randomly selected TR of each size.

Page 16: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 16

Training Sets and Test Sets

• Total number of cases for FS, SH and MX are 6288, 7165 and 13453, respectively.

• Selection of training set and test set:– Category 1: equal sized training/test sets.– Category 2: 20% of total cases for test set

varying sized training set (25%, 50%, 75% and 100% of the remaining cases )

Page 17: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 17

Training/Test Sets in Category 2

5732503010762Size of training set four

429937738101Size of training set three

286625155401Size of training set two

143312582701Size of training set one

143312582691Size of the test set

SHFSMX

Page 18: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 18

Sequence Attribute Sizes

• The size of sequence attributes indicates how many neighboring amino acid residues are included in a C5 case.

• Eight different sizes are considered in our experiments: 5, 9, 13, 17, 21, 25, 29, and 33).

Page 19: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 19

Results

• Six hundred runs are performed, each producing a decision tree as a classifier.

• Those runs are made with regard to the following factors:– Different data sets (FS, SH, MX).

– Hydrophobicity attributes (NH, RH, AH).

– Hydrophobicity value calculating methods.

– Varying training set sizes and sequence attributes.

Page 20: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 20

Results (continued)• Results obtained using training cases from SH are

consistently better.• Differences among three data sets (FS, SH, MX)

are significantly different.• Hydrophobicity and its calculation method choice

do not show improvement in predictive accuracy.• Error rate decreases as training set size increases.• No significant difference among error rates of

different sequence attribute sizes.

Page 21: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 21

Average Error Percentage

25.761.442.2Category two

23.860.741.4Category one

SHFSMX

Page 22: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 22

Predicted Secondary Structure of SSAT

Page 23: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 23

Conclusions• C5 can be used to predict protein secondary

structure.

• The prediction accuracy depends critically on selection of training data.

• Training data selected based on sequence homology are superior to functional similarity or hydrophobicity.

• The SH classifier achieves 75% accuracy.

Page 24: 11/9/99ICTAI-99, Chicago1 Protein Secondary Structure Prediction Using Data Mining Tool C5 Meiliu Lu †, Du Zhang †, Hongjun Xu †, Ken Tse-yau Lau ‡, and

11/9/99 ICTAI-99, Chicago 24

Future Work

• Improve predictive accuracy by setting new data selection criteria.

• Develop on-line service for protein structure prediction.