![Page 1: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/1.jpg)
Computational Discrete Mathematics and Statistics for
Molecular Array Data
Bill Shannon
Washington University
School of Medicine
![Page 2: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/2.jpg)
Molecular Biology
“How Genes Work”, http://www.nigms.nih.gov
![Page 3: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/3.jpg)
Microarrays
A B CGene
A B CGene
Normal Cell Tumor Cell
*Mes
seng
er R
NA
Lev
els
*Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.
![Page 4: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/4.jpg)
Microarrays (Leukemia PPG)35 Probes Selected from ~50,000
![Page 5: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/5.jpg)
Array Data Present New Data Analysis Challenges
(Curse of Dimensionality)
• Inaccuracy, or error, of a model becomes large very fast
– sparseness (descriptions of the data is impossible)
– model complexity (too many interaction terms, non-
linear effects, etc. to consider)
– random multicollinearity (spurious correlations)
![Page 6: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/6.jpg)
Regression (Curse of Dimensionality)
• y = f(x) + error
• sparseness = little local signal– model parameters not estimated accurately– unstable models over-fit data (not genralizable)
• Non-parametric methods (e.g., CART, neural nets) – require a lot of model searching– use up degree’s of freedom rapidly– little or no information left to determine significance
![Page 7: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/7.jpg)
Cluster Analysis (Curse of Dimensionality)
• Find structure in data
• Many cluster results with same goodness-of-fit
• Deciding among the models is impossible.
![Page 8: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/8.jpg)
Classification Models (Curse of Dimensionality)
• Predict group membership (e.g., tumor versus normal)
• Three broad categories
– geometric methods (discriminant analysis, CART)
– probabilistic methods (Bayesian)
– algorithmic methods (neural networks, k-NN)
• Require training/validation datasets
![Page 9: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/9.jpg)
Other Methods (Curse of Dimensionality)
• Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting)
• Multiple testing adjustment such as false discovery rate or permutation testing
![Page 10: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/10.jpg)
Mantel Statistics
• Transform standard NxP data matrices into NxN subject pairwise distances or similarities
• Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix
![Page 11: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/11.jpg)
Mantel Statistics
2,,
22,2,
21,1,, PiPiiiiiii xxxxxxd
0
0
0
0
3
2
1
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
21
N
N
N
P
PNNN
P
P
P
P
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
GGGSample
Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.
![Page 12: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/12.jpg)
Mantel Statistics
0
0
0
0
3
2
1
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
)()2()1(
N
N
N
Pk
PNNN
P
P
P
k
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
GGGSample
![Page 13: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/13.jpg)
Mantel Statistics
0
0
0
0
,3
,23,2
,13,12,1
N
N
N
Pk d
dd
ddd
D
0
0
0
0
,3
,23,2
,13,12,1
N
N
N
P d
dd
ddd
D
Signal + Noise Genes Signal Genes Only
ji
PkPkjiji
PPji
ji
PkPkji
PPjiPkP
dddd
ddddDD
2
,
2
,
,,,
![Page 14: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/14.jpg)
Mantel Statistics
Correlating DP with Dk<<P avoids curse of dimensionality!
A positive Mantel correlation indicates the genes in Dk<<P contains the same information as the genes in DP
Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.
![Page 15: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/15.jpg)
GA-Mantel
• Search algorithm to find signal genes
• Solution representation – list of genes (10 123 456 798 835 888 923)– binary vector {0000100110000….00010}
• Each solution maps to a Mantel correlation value – Assumption: the larger the correlation the more signal genes in
the solution
• Selection keeps solutions with high Mantel correlation
Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO
![Page 16: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/16.jpg)
Recombination
![Page 17: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/17.jpg)
Mutation
![Page 18: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/18.jpg)
Gene Subset Selection
• Given– a data set comprising N microarray experiments with
g genes
• Find:– a subset of genes that captures relevant
relationships among the experiments
• Goal:– reduce data for further analysis– identify meaningful biological markers for diagnosis
![Page 19: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/19.jpg)
1. Randomly generate an initial population
2. Do until stopping criteria is met:
Select individuals to be parents (biased by fitness).Produce offspring by recombination/mutation.Select individuals to die (biased by fitness).
End Do.
3. Return a result.
Genetic Algorithm
![Page 20: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/20.jpg)
Fitness Evaluation for Gene Selection
• Calculate DP using all genes
• For each Subset(k) in current population:– Calculate Dk<<P
– Correlate DP with Dk<<P
• Use Mantel Correlation as fitness to select next population of solutions
• Permute to compute P-values
![Page 21: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/21.jpg)
0
0
0
0
3
2
1111
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
N
N
N
P
PNNN
P
P
P
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
WtWtWtSample
0
0
0
0
3
2
1101
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
N
N
N
Pk
KNNN
K
K
K
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
WtWtWtSample
![Page 22: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/22.jpg)
GA on Artificial Data• Simulated data:
– 100 experiments with 10,000 genes• 100 signal genes
• 9900 noise genes
– Two groups• Group 1 has signal genes sampled from N(0, 1)
• Group 2 has signal genes sampled from N(1, 1)
• GA Parameters– population size 200
– generations 200
• Outcome measures (averaged over 10 runs of the GA)
– prevalence (signal, noise) – number of signal and noise genes in GA answer
– correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix
– coverage - number of signal genes identified over all GA runs
![Page 23: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/23.jpg)
GA on Artificial Data
Length = 30
Prevalence:mean number of signal genes = 22.9 (0.7)
76.3% (std 0.53%)
Correlation:mean rho for best subsets = 0.787 (0.009)
p-value < 0.0001
Coverage:total signal genes identified across 10
runs = 65/100
Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs
![Page 24: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/24.jpg)
GA on Golub Data Set• Data set: Golub training set (38 x 7129)• Two Groups:
– 27 samples from ALL patients– 11 samples from AML patients
• GA searched for subsets of fixed length (10 to 50)• population = 200, generations = 200
• Mantel correlation tends to increase with subset sizeLength Final Mantel Corr p-value
10 0.926 (0.005) < 0.00001
20 0.954 (0.004) < 0.00001
30 0.967 (0.002) < 0.00001
40 0.975 (0.002) < 0.00001
50 0.979 (0.002) < 0.00001
![Page 25: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/25.jpg)
Significant Feature Subsets
Clustering of Samples using all genes Clustering of Samples using 50 genes from GA
![Page 26: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/26.jpg)
Letting GA Select Subset Length• Data set: Golub training set (38 x 7129)• GA searched over variable length subsets (min=5 max=50)• Fitness penalty = d * length / 50• population = 200, generations = 200
• Tradeoff between length of subsets and correlation score
Length Penalty d Pop Final Len Best Final Length Final Mantel Corr
0.00 48.6 49.0 0.979 (0.001)
0.25 17.4 25.0 0.954 (0.005)
0.50 10.7 16.1 0.939 (0.009)
1.00 7.7 10.6 0.922 (0.009)
![Page 27: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/27.jpg)
Data Reduction
• Observation: GA appears to repeatedly converge to same regions of feature space
– In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets
• GA can also be used to find feature subsets that minimize rho
• pop = 200
• length = 50
• data set = Golub
• GA finds subsets with rho = 0 within 50 gens
![Page 28: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/28.jpg)
GA in Experimental Data Analysis
• Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia)
• T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs
• Regulatory T-cells (Treg) suppress immune response
• Choi and DiPersio are studying the genetic mechanisms of Treg regulation
![Page 29: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/29.jpg)
Mouse Array Experiment
GROUP TREATMENT ARRAYS
1 Naïve Treg dec1, dec5
2 Activated Treg dec2, dec6, dec10
3 PBST (Control) dec3, dec7, dec11
4 Decitabine treated dec4, dec8, dec12
~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work
![Page 30: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/30.jpg)
Mouse Array Experiment
• Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)
![Page 31: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/31.jpg)
![Page 32: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/32.jpg)
![Page 33: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/33.jpg)
![Page 34: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/34.jpg)
Act+Dec Vs Naïve Vs Control
![Page 35: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/35.jpg)
![Page 36: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/36.jpg)
Naïve+Dec Vs Act Vs Control
![Page 37: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/37.jpg)
![Page 38: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/38.jpg)
Summary
• GA-Mantel effective at identifying signal genes
• Longer gene subsets associated with higher scores– tradeoff: higher correlations vs. smaller subsets– requires constraining growth of subsets in GA
• GA effective at identifying noise genes
• GA-Mantel can find genes associated with phenotype
![Page 39: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/39.jpg)
Future Directions
• RFA CA-08-005 (under review)
– Optimize algorithm to improve coverage of solution space– Multiple solutions– Combine solutions (weak hierarchies)
• Lung disease R01 (to be submitted)
– Microarrays to identify disease subgroups across the bronchitis/emphysema continuum
![Page 40: Computational Discrete Mathematics and Statistics for Molecular Array Data](https://reader030.vdocuments.us/reader030/viewer/2022012922/56814691550346895db3af21/html5/thumbnails/40.jpg)
Weak Hierarchies
Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.