data mining in genomics: the dawn of personalized medicine gregory piatetsky-shapiro kdnuggets ...
TRANSCRIPT
Data Mining in Genomics: the
dawn of personalized
medicineGregory Piatetsky-Shapiro
KDnuggetswww.KDnuggets.com/gps.html
Connecticut College, October 15, 2003
22© 2003 KDnuggets
Overview
Data Mining and Knowledge Discovery
Genomics and Microarrays
Microarray Data Mining
33© 2003 KDnuggets
Trends leading to Data Flood
More data is generated:
Bank, telecom, other business transactions ...
Scientific Data: astronomy, biology, etc
Web, text, and e-commerce
More data is captured:
Storage technology faster and cheaper
DBMS capable of handling bigger DB
44© 2003 KDnuggets
______
______
______
Transformed Data
Patternsand
Rules
Target Data
RawData
KnowledgeData MiningTransformation
Interpretation& Evaluation
Selection& Cleaning
Integration
Understanding
Knowledge Discovery Process
DATAWarehouse
Knowledge
55© 2003 KDnuggets
Major Data Mining Tasks Classification: predicting an item class
Clustering: finding clusters in data
Associations: e.g. A & B & C occur frequently
Visualization: to facilitate human discovery
Summarization: describing a group
Estimation: predicting a continuous value
Deviation Detection: finding changes
Link Analysis: finding relationships
66© 2003 KDnuggets
Major Application Areas for Data Mining Solutions
Advertising Bioinformatics Customer Relationship Management (CRM) Database Marketing Fraud Detection eCommerce Health Care Investment/Securities Manufacturing, Process Control Sports and Entertainment Telecommunications Web
77© 2003 KDnuggets
Genome, DNA & Gene Expression An organism’s genome is the “program” for making the organism, encoded in DNA Human DNA has about 30-35,000 genes
A gene is a segment of DNA that specifies how to make a protein
Cells are different because of differential gene expression About 40% of human genes are expressed at
one time
Microarray devices measure gene expression
88© 2003 KDnuggets
Molecular Biology Overview
Cell Nucleus
Chromosome
ProteinGraphics courtesy of the National Human Genome Research Institute
Gene (DNA)Gene (mRNA), single strand
Geneexpression
99© 2003 KDnuggets
Affymetrix Microarrays
50um
1.28cm
~107 oligonucleotides,half Perfectly Match mRNA (PM), half have one Mismatch (MM)Gene expression computed from PM and MM
1010© 2003 KDnuggets
Affymetrix Microarray Raw Image
Gene ValueD26528_at 193D26561_cds1_at -70D26561_cds2_at 144D26561_cds3_at 33D26579_at 318D26598_at 1764D26599_at 1537D26600_at 1204D28114_at 707
Scannerenlarged section of raw image
raw data
1111© 2003 KDnuggets
Microarray Potential Applications New and better molecular diagnostics
New molecular targets for therapy few new drugs, large pipeline, …
Outcome depends on genetic signature best treatment?
Fundamental Biological Discovery finding and refining biological pathways
Personalized medicine ?!
1212© 2003 KDnuggets
Microarray Data Mining Challenges Avoiding false positives, due to
too few records (samples), usually < 100
too many columns (genes), usually > 1,000
Model needs to be robust in presence of noise
For reliability need large gene sets; for diagnostics or drug targets, need small gene sets
Estimate class probability
Model needs to be explainable to biologists
1313© 2003 KDnuggets
False Positives in Astronomy
cartoon used with permission
1414© 2003 KDnuggets
Preparation
2-Class Multi-Class
Clustering
CATs: Clementine Application Templates CATs - examples of
complete data mining processes
Microarray CAT
1515© 2003 KDnuggets
Key Ideas
Capture the complete process
X-validation loop w. feature selection inside
Randomization to select significant genes
Internal iterative feature selection loop
For each class, separate selection of optimal gene sets
Neural nets – robust in presence of noise
Bagging of neural nets
1616© 2003 KDnuggets
Microarray Classification
Train data Feature and Parameter Selection
EvaluationTest data
Data Model Building
1717© 2003 KDnuggets
Classification: External X-val
Train data Feature and Parameter Selection
EvaluationTest data
Gene Data
T r a i n
FinalTest
Data Model Building
Final Model
Final Results
1818© 2003 KDnuggets
Measuring false positives with randomization
ClassGene
178105
41747133
1122
Class
178105
41747133
2112
RandClass
2112Randomize
500 times
Bottom 1% T-value = -2.08
Select potentially interesting genes at 1%
Gene
1919© 2003 KDnuggets
Gene Reduction improves Classification most learning algorithms look for non-linear combinations of features -- can easily find many spurious combinations given small # of records and large # of genes
Classification accuracy improves if we first reduce # of genes by a linear method, e.g. T-values of mean difference
Heuristic: select equal # genes from each class
Then apply a favorite machine learning algorithm
2020© 2003 KDnuggets
Iterative Wrapper approach to selecting the best gene set Test models using 1,2,3, …, 10, 20, 30, 40, ..., 100 top genes with x-validation.
Heuristic 1: evaluate errors from each class; select # number of genes from each class that minimizes error for that class
For randomized algorithms, average 10+ Cross-validation runs!
Select gene set with lowest average error
2121© 2003 KDnuggets
Clementine stream for subset selection by x-validation
2222© 2003 KDnuggets
Microarrays: ALL/AML Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 72 examples (38 train, 34 test), about 7,000
genes
well-studied (CAMDA-2000), good test example
ALL AML
Visually similar, but genetically very different
2323© 2003 KDnuggets
Gene subset selection: one X-validation
Error Avg for 10-fold X-val
0%5%
10%15%20%25%30%
1 2 3 4 5 10 20 30 40
Genes per Class
Single Cross-Validation run
2424© 2003 KDnuggets
Gene subset selection: multiple cross-validation runs
For ALL/AML data, 10 genes per class had the lowest error: (<1%)
Point in the centeris the average error from 10 cross-validation runs
Bars indicate 1 st. devabove and below
2525© 2003 KDnuggets
ALL/AML: Results on the test data Genes selected and model trained on Train set ONLY!
Best Net with 10 top genes per class (20 overall) was applied to the test data (34 samples): 33 correct predictions (97% accuracy),
1 error on sample 66 Actual Class AML, Net prediction: ALL
other methods consistently misclassify sample 66 -- misclassified by a pathologist?
2626© 2003 KDnuggets
Pediatric Brain Tumour Data
92 samples, 5 classes (MED, EPD, JPA, EPD, MGL, RHB) from U. of Chicago Children’s Hospital
Outer cross-validation with gene selection inside the loop
Ranking by absolute T-test value (selects top positive and negative genes)
Select best genes by adjusted error for each class
Bagging of 100 neural nets
2727© 2003 KDnuggets
Selecting Best Gene Set
Minimizing Combined Error for all classes is not optimal
Average, high and low error rate for all classes
2828© 2003 KDnuggets
Error rates for each class
Error
rate
Genes per Class
2929© 2003 KDnuggets
Evaluating One Network
Class Error rate
MED 2.1%
MGL 17%
RHB 24%
EPD 9%
JPA 19%
*ALL* 8.3%
Averaged over 100 Networks:
3030© 2003 KDnuggets
Bagging 100 Networks
Note: suspected error on one sample (labeled as MED but consistently classified as RHB)
Class Individual Error Rate
Bag Error rate
Bag Avg Conf
MED 2.1% 2% (0)* 98%
MGL 17% 10% 83%
RHB 24% 11% 76%
EPD 9% 0 91%
JPA 19% 0 81%
*ALL* 8.3% 3% (2)* 92%
3131© 2003 KDnuggets
AF1q: New Marker for Medulloblastoma? AF1Q ALL1-fused gene from chromosome 1q transmembrane protein Related to leukemia (3 PUBMED entries) but not to
Medulloblastoma
3232© 2003 KDnuggets
Future directions for Microarray Analysis Algorithms optimized for small samples
Integration with other data biological networks
medical text
protein data
Cost-sensitive classification algorithms error cost depends on outcome (don’t want to
miss treatable cancer), treatment side effects, etc.
3333© 2003 KDnuggets
Acknowledgements
Eric Bremer, Children’s Hospital (Chicago) & Northwestern U.
Greg Cooper, U. Pittsburgh
Tom Khabaza, SPSS
Sridhar Ramaswamy, MIT/Whitehead Institute
Pablo Tamayo, MIT/Whitehead Institute
3434© 2003 KDnuggets
Thank you
Further resources on Data Mining: www.KDnuggets.com
Microarrays:
www.KDnuggets.com/websites/microarray.html
Contact:
Gregory Piatetsky-Shapiro: www.kdnuggets.com/gps.html