study of protein prediction related problems ph.d. candidate 2013.10.16 le-yi wei 1
TRANSCRIPT
Study of Protein Prediction Related
Problems
Ph.D. candidate
2013.10.16
Le-Yi WEI
1
1
2
3
Background
Methods
Experiments
Contents
2
Background
3
>Example PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADELKKSADVRWHAERIINAVDDAVASMDDTEKMSMKLRNLSGKHAKSFQVDPEYFKVLAAVIADTVAAGDAGFEKLMSMI
4
Definition of protein
20 different amino acids
… …
A C D … … V W Y
Protein prediction related problems
5
Protein Protein structural class prediction
Protein foldprediction
Multi-functional enzyme predictionProtein remote
homology detection
Other protein-related problems, etc.
Protein subcellular localization prediction
6
Common points
Treat the protein-related problems as classification tasks
Query protein sequence
Data presentation
Classificationalgorithms
Predictedresults
The framework of a classification task
Two major components
Methods
7
Feature extraction methods
8
Primary sequence based
Secondary structure based
Sequence-structure based
e.g. Physicochemical features, N-gram, Functional Domain, PSSM-profile (auto-covariance), etc.
e.g. Secondary sequence based, and probability matrix based
e.g. Triple-sequence-structure features
Primary-sequence based
9
• n-gram model
Given a query protein sequence:
Compute
Obtain
10
A query protein sequence
… …
…
Database sequence 1
Database sequence 2
Database sequence 3
Database sequence n-2
Database sequence n-1
Database sequence n
… …
…
0
1
0
1
0
0
PSI-BLAST
Functional protein database
Featurevector
Primary-sequence based
• Functional domain
… …
…
11
Position-Specific Score Matrix (PSSM)
Protein database
PSI-BLAST
Primary-sequence based
• Evolution information
1220-D features
Primary-sequence based
• AAC features
Compute
Obtain
1320*g-D features
Primary-sequence based
• Auto-covariance (AC) transformation
Compute
Obtain
14
Primary-sequence based
PSSM profile Frequency profile
• Consensus sequence
Consensus sequence:
A query sequence:
15
Secondary structure based
• Secondary structure sequence
SLFEQLGGQAAVQAVTAQFYANIQADA example of a query protein sequence :
CCHEHEEEEECCCCHHHHHHEEEEECC
Predicted secondary structure sequence , which has three
states:
PSI-PRED
C (coil), H (Helix), E (strand)
16
Secondary structure based
• Structure state confidence matrix
A example of a structure state confidence matrix:
A query protein sequencePredicted structure sequence
Predicted confidence
17
Secondary structure based
• Global structural features
Compute Obtain
Structure state confidence matrix:
18
Secondary structure based
• Local structural features
Compute Obtain
Structure state confidence matrix:
19
Sequence-structure based
The framework of triple sequence-structure feature extraction method
20
Classification algorithms
Commonly used classification algorithms
e.g. Support Vector Machine (SVM), Random Forest (RF), SMO, Naive Bayes, etc.
Ensemble classification algorithms
e.g. Majority Vote, Average Probability, Selective Ensemble, etc.
Experiments
21
22
The framework of RF_PSCP
Webserver site : http://59.77.16.70:8080/RF_PSCP/Index.html
23
Datasets
Three benchmark datasets
Three updated large-scale datasets
Sequence similarity
• Protein structural class prediction
24
Results
Comparison with existing methods on three benchmark datasets
25
Results
Tests of the proposed method on three updated large-scale datasets
26
Results
Comparison with different combinations of feature subsets on three benchmark datasets
27
Results
Optimization of Random forest classifier
28
Q&A!
29