protein local 3d structure prediction by super granule support vector machines (super gsvm)

26
Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM) Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009

Upload: jenny

Post on 04-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM). Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas Fall 2009. Goal of the Dissertation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Protein Local 3D Structure Prediction by Super Granule

Support Vector Machines (Super GSVM)

Dr. Bernard Chen Assistant Professor

Department of Computer Science University of Central Arkansas

Fall 2009

Page 2: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Goal of the Dissertation The main purpose is trying to obtain

and extract protein sequence motifs information which are universally conserved and across protein

family boundaries.

And then use these information to do Protein Local 3D Structure Prediction

Page 3: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

ResearchFlow

Part3Motif Information Extraction

Part2Discovering Protein

Sequence Motifs

Part1Bioinformatics Knowledge

and Dataset Collection

Part4Protein Local Tertiary Structure Prediction

Page 4: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Data set

Page 5: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

HSSP matrix: 1b25

Page 6: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

HSSP matrix: 1b25

Page 7: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

HSSP matrix: 1b25

Page 8: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Representation of Segment Sliding window size: 9 Each window corresponds to a sequence

segment, which is represented by a 9 × 20 matrix plus additional nine corresponding secondary structure information obtained from DSSP.

More than 560,000 segments (413MB) are generated by this method.

DSSP: Obtain 2nd Structure information

Page 9: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

ResearchFlow

Part3Motif Information Extraction

Part2Discovering Protein

Sequence Motifs

Part1Bioinformatics Knowledge

and Dataset Collection

Part4Protein Local Tertiary Structure Prediction

Page 10: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Granular Computing Model

Original dataset

Fuzzy C-Means Clustering

Information Granule 1

Information Granule M

New Improved or Greedy K-means Clustering

New Improved or Greedy K-means Clustering

Join Information

Final Sequence Motifs Information

...

...

Page 11: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Reduce Time-complexity

Wei’s method: 1285968 sec (15 days) * 6 = 7715568 sec (90 days)

Granular Model: 154899 sec + 231720 sec * 6 = 1545219 sec (18 days) (FCM exe time) (2.7 Days)

Page 12: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Comparison of Quality Measures

Different Methods >60% S.D. >70% S.D. H-B Measure

Traditional 25.82% 0.93 10.44% 0.61 0.2543

Zhong-60-1020 31.46% 0.26 10.42% 0.59 0.2871

Zhong-61-985 31.71% 0.81 10.84% 0.07 0.2784

Zhong-62-900 31.04% 0.19 10.29% 0.64 0.2768

FCM-K-means 37.14% 1.46 12.99% 0.74 0.3589

FIK Model

FIK Model 0 40.15% 1.09 13.44% 0.49 0.3730

FIK Model 800 40.23% 0.45 13.37% 0.58 0.3717

FIK Model 1000 39.15% 0.39 13.27% 0.29 0.3665

FIK Model 1200 38.90% 0.43 12.89% 0.77 0.3697

FIK Model 1400 37.80% 0.80 12.59% 0.44 0.3655

FGK Model

FGK Model 200 42.45% 0.06 14.14% 0.02 0.3393

FGK Model 250 42.77% 0.07 14.06% 0.07 0.3443

FGK Model 300 41.08% 0.14 13.89% 0.02 0.3311

FGK Model 350 37.47% 0.51 13.49% 0.14 0.3489

FGK Model 400 37.62% 1.56 13.86% 1.29 0.3676

Best Selection 44.18% 0 15.02% 0 0.3664

Page 13: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

ResearchFlow

Part3Motif Information Extraction

Part2Discovering Protein

Sequence Motifs

Part1Bioinformatics Knowledge

and Dataset Collection

Part4Protein Local Tertiary Structure Prediction

Page 14: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Super GSVM-FE Motivation First, the information we try to generate is

about sequence motifs, but the original input data are derived from whole protein sequences by a sliding window technique;

Second, during fuzzy c-means clustering, it has the ability to assign one segment to more than one information granule.

Page 15: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Original dataset

Fuzzy C-Means Clustering

Information Granule 1

Information Granule M

Greedy K-means Clustering

Greedy K-means Clustering

Join Information

Final Sequence Motifs Information

...

...

For Each Cluster

Ranking SVMFeature Elimination

...Ranking SVMFeature Elimination

Greedy K-means Clustering

Greedy K-means Clustering

...

… … For Each Cluster

Collect SurvivedSegments

Collect SurvivedSegments

… …

Five iterations of traditional K-maens

Five iterations of traditional K-maens

For Each Cluster

For Each Cluster

...

Super GSVM-FE

Additional Portion

Page 16: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Extracted Motif Information

Page 17: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

ResearchFlow

Part3Motif Information Extraction

Part2Discovering Protein

Sequence Motifs

Part1Bioinformatics Knowledge

and Dataset Collection

Part4Protein Local Tertiary Structure Prediction

Page 18: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file

Page 19: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

3D information 3D information is generated from PDB (Protein Data Bank), an example of 1a3c PDB file

Page 20: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Testing Data The latest release of PISCES includes

4345 PDB files. Compare with the dataset in our

experiment, 2419 PDB files are excluded.

Therefore, we regard our 2710 protein files as the training dataset and 2419 protein files as the independent testing dataset.

Page 21: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Testing Data

We convert the testing dataset by the approach we introduced

more than 490,000 segments are generated as testing dataset.

Page 22: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Super GSVM

Training dataset

Fuzzy C-Means Clustering

Information Granule 1

Information Granule M

Greedy K-means Clustering

Greedy K-means Clustering

Collect all extracted clusters and Ranking-SVMs

...

...

For Each Cluster

Train Ranking SVMand thenEliminate 20% lower rank members

... Train Ranking SVMand thenEliminate 20% lower rank members

… … For Each Cluster

Five iterations of traditional K-means

Five iterations of traditional K-means

All Sequence clusters

All Ranking SVMs

Independent testing Dataset

Feed to the belonging

SVM Predict the local 3D structure

If the rank belongs to

cluster

Find the closest cluster within a given

distance threshold

If not, find the next closest

cluster

Page 23: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Prediction Accuracy

Page 24: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Prediction Coverage

Page 25: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Future Works

Incorporate Chou-Fasman parameter for SVM training

Page 26: Protein Local 3D Structure Prediction by Super Granule Support Vector Machines (Super GSVM)

Future Works For each

cluster, instead of building SVM model, we build Decision Tree instead

Training dataset

Fuzzy C-Means Clustering

Information Granule 1

Information Granule M

Greedy K-means Clustering

Greedy K-means Clustering

Collect all extracted clusters and Ranking-SVMs

...

...

For Each Cluster

Build Decision Tree

...Build Decision Tree

… … For Each Cluster

Five iterations of traditional K-means

Five iterations of traditional K-means

All Sequence clusters

Test by DT

Independent testing Dataset

Feed to the belonging

DT Predict the local 3D structure

If the rank belongs to

cluster

Find the closest cluster within a given

distance threshold

If not, find the next closest

cluster