biomarker and classifier selection in diverse genetic datasets
DESCRIPTION
Biomarker and Classifier Selection in Diverse Genetic Datasets. University Of Connecticut 1 Department of Computer Science and Engineering 2 Department of Molecular and Cell Biology. James Lindsay 1 Ed Hemphill 2 Chih Lee 1 Ion Mandoiu 1 Craig Nelson 2. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/1.jpg)
Biomarker and Classifier Selection in Diverse
Genetic Datasets
JAMES LINDSAY1
ED HEMPHILL2
CHIH LEE1
ION MANDOIU1
CRAIG NELSON2
UNIVERSITY OF CONNECTICUT1DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING2DEPARTMENT OF MOLECULAR AND CELL BIOLOGY
![Page 2: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/2.jpg)
Motivation 1: Cell-type Identification
• The Question: Smallest # of genes to identify each cluster:
• B: Bone• C: Myeloid• D: Endothelial
• Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage.
In collaboration with: Dr. Hector Leonardo Aguila, UCHC
![Page 3: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/3.jpg)
Motivation 2: Clinical Diagnostics
• Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012
Study # genes Sensitivity (%) Specificity (%)
Lequerre 20 71 61
Stuhlmuller 11 79 56
Stuhlmuller 82 67 56
Lequerre 8 71 28
Sekiguchi 18 71 28
Julia 8 92 17
Stuhlmuller 3 71 17
Tanio 8 67 33
![Page 4: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/4.jpg)
Multi-class Classification Problem
Multi-class Classification• There are 2 or more classes• Supervised learning
Key Problems:1. Feature Selection: What are the most predictive
biomarkers?2. Classification: What is the best classification algorithm?
![Page 5: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/5.jpg)
Challenges• Different types of data
• Gene expression• Epigenetic data
• Methylation• Histone modification
• Proteomics• Metabolomics• Phenotypes
• Different Platforms• Microarray• Sequencing• In-situ hybridization
• Different Resolutions• Discrete vs Continuous• Sparse vs Complete
![Page 6: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/6.jpg)
Minimal Unique Marker Panel Selection (Mumps) Pipeline
Feature Selection
Classification
Parameterize each combination of feature selection and classification algorithms
Inner Cross-validation
Rank Models by AUC
Outer Cross-validation
Output: the best features and classifier
Input: # of biomarkers:
Nes
ted
Cro
ss V
alid
atio
n
![Page 7: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/7.jpg)
Feature Selection
• (SVM)-recursive feature elimination (RFE)
• ANOVA F-value• Random Forests• Extra Trees
Algorithms
• Correlation• Cosine• K-Nearest Neighbors
(KNN)• Support Vector Machine
(SVM)• Decision Tree• Random Forests• Extra Trees• Gradient Boosting
Classification
![Page 8: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/8.jpg)
Datasets
• From Broad Institute• Affymetrix Gene expression
microarray• 15 hematopoietic cell types• 82 samples • 4-7 samples per cell type.
• Multiple Sources• 70 samples • Approximately 3-7 samples
per cell type.• Affymetrix & Illumina Bead
Array• Different labs
![Page 9: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/9.jpg)
Experiments
• Complete • Complete gene expression
profile from microarray datasets.
• Simulated Sparse • 70% and 50% missing data• Coverage of a marker
followed a Beta distribution.
• The fraction of cell types having known expression statuses for a marker.
• Fifteen simulations
• Cross-validation• 3-fold, stratified• # features:
• 2, 8, 16, 32, 64, 96, 128, 256, and 384
• Best set of features and classifier for each # features
• External validation• Use Broad data as
training• Test against external
datasets
![Page 10: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/10.jpg)
Performance: Complete Data
2 8 16 32 64 96 128 256 3840.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Continuous CVDiscrete CVContinuous EVDiscrete EV
Number of Markers
AU
C
![Page 11: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/11.jpg)
By Algorithm: Complete Data
2 8 16 32 64 96 128 256 3840.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
RFE - Extra Trees RFE - Random Forest RFE - Correlation RFE - CosineRFE - Decision Tree RFE - Gradient Boosting RFE - KNN RFE - SVM
Number of Markers
AU
C
![Page 12: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/12.jpg)
Performance: 70% Missing
2 8 16 32 64 96 128 256 3840.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Discrete CVContinuous CVDiscrete EVContinuous EV
Number of Markers
AU
C
![Page 13: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/13.jpg)
Summary: Best Algorithms
Complete 70% missing
# of markers FS CL FS CL
2 RFE KNNRFE Extra Trees
8RFE Cosine RFE Cosine
16RFE Cosine RFE Cosine
32RFE Cosine RFE Cosine
64RFE Cosine RFE Cosine
96RFE Cosine RFE Correlation
128RFE Cosine RFE Correlation
256RFE Cosine RFE Correlation
384RFE Correlation RFE Correlation
![Page 14: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/14.jpg)
Why the Big Gap?• Cross-platform
normalization
• Similarities in cell-types
• Over-fitting
Correlation: Broad vs External
![Page 15: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/15.jpg)
Mesoderm Cell-type Identification
Anti-TNF Responsivness
Motivation Results
# genes AUC
873 %
1674 %
3276 %
6478 %
9687 %
12891 %
25691 %
38492 %
Study # genesSensitivity
(%)Specificity
(%)
Lequerre 20 71 61Stuhlmulle
r 11 79 56Stuhlmulle
r 82 67 56
Lequerre 8 71 28
Sekiguchi 18 71 28
Julia 8 92 17Stuhlmulle
r 3 71 17
Tanio 8 67 33
UCONN 8 83 83
UCONN 2048 94 96
![Page 16: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/16.jpg)
Future Work
• Broader Data-types• NCI-60
• microarray mRNA• microarray microRNA• copy number variation• protein array• SNPs• …
• Minimizing over fitting
• Cross-platform• normalization
• Different Data types • Integrate multiple data
types simultaneously
![Page 17: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/17.jpg)
Conclusion and Thanks
• Thanks to:• Ed Hemphill• Chih Lee• Ion Mandoiu• Craig Nelson
Smpl BioA commercial service coming in late 2013
![Page 18: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/18.jpg)
DON’T GO BEYOND, TIS A SILLY PLACE
Extra Slides
![Page 19: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/19.jpg)
Experiment Overview
Parameterize each combination of feature selection and classification algorithms
Output the best features and classifier
Feature Selection
Classification
Inner Cross-validation
Rank Models by AUC
Outer Cross-validation
Input: # of biomarkers:N
este
d C
ross
Val
idat
ion
Test Best Model Output: AUC of best features / classifier
Bro
ad D
ata
Ext
erna
l
Test
ing
![Page 20: Biomarker and Classifier Selection in Diverse Genetic Datasets](https://reader036.vdocuments.us/reader036/viewer/2022062521/56816685550346895dda3065/html5/thumbnails/20.jpg)
Performance: 50% Missing
1 2 3 4 5 6 7 8 90.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
Continuous CVContinuous EV
Number of Markers
AU
C