supervised learning for gene expression microarray data david page university of wisconsin
TRANSCRIPT
![Page 1: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/1.jpg)
Supervised Learning for Gene Supervised Learning for Gene Expression Microarray DataExpression Microarray Data
David Page
University of Wisconsin
![Page 2: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/2.jpg)
Joint Work with:Joint Work with:
Mike Waddell, James Cussens, Jo HardinFrank Zhan, Bart Barlogie, John
Shaughnessy
![Page 3: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/3.jpg)
Common ApproachesCommon Approaches
Comparing two measurements at a time– Person 1, gene G: 1000– Person 2, gene G: 3200– Greater than 3-fold change: flag this gene
Comparing one measurement with a population of measurements… is it unlikely that the new measurement was drawn from same distribution?
![Page 4: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/4.jpg)
Approaches (Continued)Approaches (Continued)
Clustering or Unsupervised Data Mining– Hierarchical Clustering, Self-Organizing
(Kohonen) Maps (SOMs), K-Means Clustering– Cluster patients with similar expression patterns– Cluster genes with similar patterns across
patients or samples (genes that go up or down together)
![Page 5: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/5.jpg)
Approaches (Continued)Approaches (Continued)
Classification or Supervised Data Mining.– Use our knowledge of class values… myeloma
vs. normal, positive response vs. no response to treatment, etc., to gain added insight.
– Find genes that are best predictors of class. Can provide useful tests, e.g. for choosing
treatment. If predictor is comprehensible, may provide novel
insight, e.g., point to a new therapeutic target.
![Page 6: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/6.jpg)
Approaches (Continued)Approaches (Continued)
Classification or Supervised Learning.– UC Santa Cruz: Furey et al. 2001 (support
vector machines).– MIT Whitehead: Golub et al. 1999, Slonim et
al. 2000 (voting).– SNPs and Proteomics are coming.
![Page 7: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/7.jpg)
OutlineOutline
Data and TaskSupervised Learning Approaches and
Results– Tree Models and Boosting– Support Vector Machines– Voting– Bayesian Networks
Conclusions
![Page 8: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/8.jpg)
DataData
Publicly-available from Lambert Lab at http://lambertlab.uams.edu/publicdata.htm
105 samples run on Affymetrix HuGenFL– 74 Myeloma samples– 31 Normal samples
![Page 9: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/9.jpg)
Two Ways to View the DataTwo Ways to View the Data
Data points are genes.– Represented by expression levels across
different samples.– Goal: find related genes.
Data points are samples (e.g., patients).– Represented by expression levels of different
genes.– Goal: find related samples.
![Page 10: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/10.jpg)
Two Ways to View The DataTwo Ways to View The Data
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 P 1142.0 A 321.0 P 2567.2 . . . Person 2 A -586.3 P 586.1 P 759.0 . . . Person 3 A 105.2 A 559.3 P 3210.7 . . . Person 4 P -42.8 P 692.1 P 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 11: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/11.jpg)
Data Points are GenesData Points are Genes
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 P 1142.0 A 321.0 P 2567.2 . . . Person 2 A -586.3 P 586.1 P 759.0 . . . Person 3 A 105.2 A 559.3 P 3210.7 . . . Person 4 P -42.8 P 692.1 P 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 12: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/12.jpg)
Data Points are SamplesData Points are Samples
Person Gene A28202_ac AB00014_at AB00015_at . . . Person 1 P 1142.0 A 321.0 P 2567.2 . . . Person 2 A -586.3 P 586.1 P 759.0 . . . Person 3 A 105.2 A 559.3 P 3210.7 . . . Person 4 P -42.8 P 692.1 P 812.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 13: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/13.jpg)
Supervision: Add ClassesSupervision: Add Classes
Person Gene A28202_ac AB00014_at AB00015_at . . . CLASS Person 1 P 1142.0 A 321.0 P 2567.2 . . . myeloma Person 2 A -586.3 P 586.1 P 759.0 . . . normal Person 3 A 105.2 A 559.3 P 3210.7 . . . myeloma Person 4 P -42.8 P 692.1 P 812.0 . . . normal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
![Page 14: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/14.jpg)
Data Points are: Genes Patients
Clustering Supervised Data Mining
Predict the class value for a patient
based on the expression levels for his/her genes
The TaskThe Task
![Page 15: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/15.jpg)
OutlineOutline
Data and TaskSupervised Data Mining Algorithms
– Tree Models and Boosting– Support Vector Machines– Voting– Bayesian Networks
Conclusions
![Page 16: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/16.jpg)
Decision Trees in One PictureDecision Trees in One Picture
Myeloma Normal
AvgDiff of G5 < 1000 > 1000
![Page 17: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/17.jpg)
C5.0 (Quinlan) ResultC5.0 (Quinlan) Result
Decision tree:
AD_X57809_at <= 20343.4: myeloma (74)
AD_X57809_at > 20343.4: normal (31)
Leave-one-out cross-validation accuracy estimate: 97.1%
X57809: IGL (immunoglobulin lambda locus)
![Page 18: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/18.jpg)
Problem with ResultProblem with Result
Easy to predict accurately with genes related to immune function, such as IGL, but this gives us no new insight.
Eliminate these genes prior to training.
![Page 19: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/19.jpg)
Ignoring Genes Associated Ignoring Genes Associated with Immune functionwith Immune function
Decision tree:
AD_X04898_rna1_at <= -1453.4: normal (30)
AD_X04898_rna1_at > -1453.4: myeloma (74/1)
X04898: APOA2 (Apolipoprotein AII)
Leave-one-out accuracy estimate: 98.1%.
![Page 20: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/20.jpg)
Next-Best TreeNext-Best TreeAD_M15881_at > 992: normal (28)
AD_M15881_at <= 992:
AC_D82348_at = A: normal (3)
AC_D82348_at = P: myeloma (74)
M15881: UMOD (uromodulin…Tamm-Horsfall glycoprotein, uromucoid)
D82348: purH
Leave-one-out accuracy estimate: 93.3%
![Page 21: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/21.jpg)
GeneCards Reveals…GeneCards Reveals…UROM_HUMAN: uromodulin precursor (tamm-horsfall urinaryglycoprotein) (thp).--gene: umod. [640 amino acids; 69 kd] function: not known. may play a role in regulating thecirculating activity of cytokines as it binds to il-1, il-2and tnf with high affinity. subcellular location: attached to the membrane by a gpi-anchor, then cleaved to produce a soluble form which isSecreted in urine. tissue specificity: synthesized by the kidneys and is themost abundant protein in normal human urine.
![Page 22: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/22.jpg)
BoostingBoosting
After building a tree, give added weight to any data point the tree mislabels.
Learn a new tree from re-weighted data.Repeat 10 times.To classify a new data point, let trees vote
(weighted by their accuracies on the training data).
![Page 23: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/23.jpg)
Boosting ResultsBoosting Results
Leave-one-out accuracy estimate: 99.0%.With Absolute Calls only: 96.2%.But it is much harder to understand, or gain
insight from, a weighted set of trees than from a single tree.
![Page 24: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/24.jpg)
Summary of AccuraciesSummary of Accuracies
AC Only AC + AD
Trees
Boosted Trees
SVMs
Vote
Bayes Nets
90.5 98.1
95.2 99.0
![Page 25: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/25.jpg)
OutlineOutline
Data and TaskSupervised Data Mining Algorithms
– Tree Models and Boosting– Support Vector Machines– Voting– Bayesian Networks
Conclusions
![Page 26: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/26.jpg)
Support Vector MachinesMaximizing the Margin between Bounding
Planes (Mangasarian & Fung)
Myeloma
wNormal
![Page 27: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/27.jpg)
SVM Results (Defaults)SVM Results (Defaults)
Accuracy using Absolute Call only is better than accuracy using AC + AD.– AC: 95.2%– AC + AD: 93.3%
Difficult to interpret results… open research area to extract most important genes from SVM.
Might be useful for choosing a therapy but not yet for gaining insight into disease.
![Page 28: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/28.jpg)
Summary of AccuraciesSummary of Accuracies
AC Only AC + AD
Trees
Boosted Trees
SVMs
Vote
Bayes Nets
90.5 98.1
96.2 99.0
95.2 93.3 100.0
![Page 29: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/29.jpg)
OutlineOutline
Data and TaskSupervised Data Mining Algorithms
– Tree Models and Boosting– Support Vector Machines– Voting– Bayesian Networks
Conclusions
![Page 30: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/30.jpg)
Voting ApproachVoting Approach
Score genes using information gain.Choose top 1% (or other number) scoring
genes.To classify a new case, let these genes vote
(majority or weighted majority vote).We use majority vote here.
![Page 31: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/31.jpg)
Voting Results (Absolute Call)Voting Results (Absolute Call)
Using only Absolute Calls, accuracy is 94.0%.
Appears we can improve accuracy by requiring only 40% of genes to predict myeloma in order to make a myeloma prediction.
Would be interesting to test this on new Lambert Lab data.
![Page 32: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/32.jpg)
Top Voters (AC Only)Top Voters (AC Only)SCORE GENE MP MA NP NA0.446713 H1F2 57 17 0 310.446713 NCBP2 57 17 0 310.432706 SM15 56 18 0 310.432706 GCN5L2 56 18 0 310.412549 maj hist comp 12 62 29 20.411956 RNASE6 15 59 30 10.411956 TNFRSF7 15 59 30 10.411956 SDF1 15 59 30 1
![Page 33: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/33.jpg)
Voting Results (AC + AD)Voting Results (AC + AD)
All top 1% splits are based on AD.Leave-one-out results appear to be 100%…
double-checking this to be sure.35 is cutoff point for myeloma vote. No
normal gets more than 15 votes, and no myeloma gets fewer than 55.
![Page 34: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/34.jpg)
Top Voters (AD)Top Voters (AD)SCORE GENE SPLIT MH ML NH NL0.802422 APOA2 -777 74 0 1 300.735975 HERV K22 pol 637 3 71 31 00.704489 TERT -1610 70 4 0 310.701219 UMOD 1119.1 0 74 28 30.701219 CDH4 -278 74 0 3 280.664859 ACTR1A 3400.6 3 71 30 10.664859 MASP1 -536.6 71 3 1 300.650059 PTPN21 1256.1 6 68 31 0
![Page 35: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/35.jpg)
Summary of AccuraciesSummary of Accuracies
AC Only AC + AD
Trees
Boosted Trees
SVMs
Vote
Bayes Nets
90.5 98.1
96.2 99.0
95.2
94.0 100.0
100.0
![Page 36: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/36.jpg)
OutlineOutline
Data and TaskSupervised Data Mining Algorithms
– Tree Models and Boosting– Support Vector Machines– Voting– Bayesian Networks
Conclusions
![Page 37: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/37.jpg)
Bayes Nets for Gene Bayes Nets for Gene Expression DataExpression Data
Friedman et al. 1999 has been followed by much work on this approach.
Up to now, primarily used to discovery dependencies among genes, not to predict class values.
Recent experience suggests using Bayes nets to predict class values.
![Page 38: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/38.jpg)
KDDKDD--2001 Cup2001 CupThe Genomics ChallengeThe Genomics Challenge
Christos Hatzis, Silico InsightsChristos Hatzis, Silico InsightsDavid Page, University of WisconsinDavid Page, University of Wisconsin
CoCo--chairschairs
August 26, 2001August 26, 2001
Special thanks: DuPont Pharmaceuticals Research Laboratories for providing data set 1, Chris Kostas from Silico Insights for cleaning and organizing data sets 2 and 3
http:/ / www.cs.wisc.edu/ ~dpage/ kddcup2001/
![Page 39: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/39.jpg)
The model & its The model & its performanceperformance
356356128128negneg
55559595pospos
negnegpospospredicted
Actual
Accuracy: 0.711Weighted Accuracy: 0.684
Activity
10695
16794 79651
91839
![Page 40: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/40.jpg)
Bayes Nets ResultBayes Nets Result
Network with 23 genes selected.Diagnosis node is parent of 20 others.
Others have at most three other parents.Leave-one-out accuracy estimate is 97%.Software is not capable of handling
numerical values at this time.
![Page 41: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/41.jpg)
NCBP2
MHC2 beta W52 RNASE6
GCN5L2
LAMC1
CTSH
H1F2
TNFRSF7
PTPRK
SSA2
DEFA1
NOT56L
EIF3S9 CDKN1A
APOC1
GYS1
S100A9
STIP1 ABL1
diagnosis SDF1
IFRD2 DPYSL2
![Page 42: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/42.jpg)
Summary of AccuraciesSummary of Accuracies
AC Only AC + AD
Trees
Boosted Trees
SVMs
Vote
Bayes Nets
90.5 98.1
96.2 99.0
95.2 100.0
94.0
95.2 NA
100.0
![Page 43: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/43.jpg)
Further WorkFurther Work
Interpreting SVMs.Analyzing new, larger data sets.Other classification tasks: prognosis,
treatment selection, MGUS vs. Myeloma.
![Page 44: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/44.jpg)
ConclusionsConclusions
Supervised learning produces highly accurate predictions for this task. Noise not a problem.
Don’t throw out negative average differences! So far the ability of SVMs to consider magnitude
of differences in expression level has not yielded benefit over voting, which just uses consistency.
Domain experts like readability of trees, voting, Bayes nets, but trees give worse accuracy.
Many of the most predictive genes line up with expectations of domain experts.
![Page 45: Supervised Learning for Gene Expression Microarray Data David Page University of Wisconsin](https://reader031.vdocuments.us/reader031/viewer/2022012916/56649da05503460f94a8b7ee/html5/thumbnails/45.jpg)
Using Absolute Calls OnlyUsing Absolute Calls OnlyU78525_at = A: normal (21/1)
U78525_at = P:
M62505_at = P: normal (5)
M62505_at = A:
AF002700_at = M: normal (2)
AF002700_at = A:
U97188_at = P: normal (2)
U97188_at = A:
HG415-HT415_at = A: myeloma (72)
HG415-HT415_at = P: normal (3/1)