swiss score
DESCRIPTION
SWISS Score. Nice Graphical Introduction:. SWISS Score. Toy Examples (2-d): Which are “More Clustered?”. SWISS Score. Toy Examples (2-d): Which are “More Clustered?”. SWISS Score. Avg. Pairwise SWISS – Toy Examples. Hiearchical Clustering. Aggregate or Split, to get Dendogram - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/1.jpg)
SWISS ScoreNice Graphical Introduction:
n
ii
Ciji
K
j
K
XX
XX
CCCISWISS j
1
2
2
1
1 ,,
![Page 2: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/2.jpg)
SWISS ScoreToy Examples (2-d): Which are “More Clustered?”
![Page 3: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/3.jpg)
SWISS ScoreToy Examples (2-d): Which are “More Clustered?”
![Page 4: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/4.jpg)
SWISS Score
Avg. Pairwise SWISS – Toy Examples
![Page 5: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/5.jpg)
Hiearchical Clustering
Aggregate or Split, to get Dendogram
Thanks to US EPA: water.epa.gov
![Page 6: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/6.jpg)
SigClust
• Statistical Significance of Clusters• in HDLSS Data• When is a cluster “really there”?
Liu et al (2007), Huang et al (2014)
![Page 7: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/7.jpg)
DiProPerm Hypothesis Test
Suggested Approach: Find a DIrection
(separating classes) PROject the data
(reduces to 1 dim) PERMute
(class labels, to assess significance,
with recomputed direction)
![Page 8: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/8.jpg)
DiProPerm Hypothesis Test
Finds
Significant
Difference
Despite Weak
Visual
Impression
Thanks to Josh Cates
![Page 9: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/9.jpg)
DiProPerm Hypothesis Test
Also Compare: Developmentally Delayed
No
Significant
Difference
But Strong
Visual
ImpressionThanks to Josh Cates
![Page 10: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/10.jpg)
DiProPerm Hypothesis Test
Two
Examples
Which Is
“More
Distinct”?
Visually Better Separation? Thanks to Katie Hoadley
![Page 11: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/11.jpg)
DiProPerm Hypothesis Test
Two
Examples
Which Is
“More
Distinct”?
Stronger Statistical Significance! Thanks to Katie Hoadley
![Page 12: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/12.jpg)
DiProPerm Hypothesis Test
Value of DiProPerm: Visual Impression is Easily Misleading
(onto HDLSS projections,
e.g. Maximal Data Piling) Really Need to Assess Significance DiProPerm used routinely
(even for variable selection)
![Page 13: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/13.jpg)
Interesting Statistical Problem
For HDLSS data:When clusters seem to appear
E.g. found by clustering method
How do we know they are really there?Question asked by Neil Hayes
Define appropriate statistical significance?
Can we calculate it?
![Page 14: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/14.jpg)
Simple Gaussian Example
Results:Random relabelling T-stat is not significantBut extreme T-stat is strongly significantThis comes from clustering operationConclude sub-populations are differentNow see that:
Not the same as clusters really thereNeed a new approach to study clusters
![Page 15: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/15.jpg)
Statistical Significance of Clusters
Basis of SigClust Approach:
What defines: A Single Cluster?A Gaussian distribution (Sarle & Kou 1993)
So define SigClust test based on:2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically???
![Page 16: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/16.jpg)
SigClust Statistic – 2-Means Cluster Index
Measure of non-Gaussianity:2-means Cluster IndexFamiliar Criterion from k-means ClusteringWithin Class Sum of Squared Distances to
Class MeansPrefer to divide (normalize) by Overall Sum
of Squared Distances to MeanPuts on scale of proportions
![Page 17: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/17.jpg)
SigClust Gaussian null distribut’n
Which Gaussian (for null)?
Standard (sphered) normal?No, not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian
Need Gaussian more like data:Need Full modelChallenge: Parameter EstimationRecall HDLSS Context
![Page 18: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/18.jpg)
SigClust Gaussian null distribut’n
Estimated Mean, (of Gaussian dist’n)?1st Key Idea: Can ignore thisBy appealing to shift invariance of CI
When Data are (rigidly) shiftedCI remains the same
So enough to simulate with mean 0Other uses of invariance ideas?
![Page 19: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/19.jpg)
SigClust Gaussian null distribut’n
2nd Key Idea: Mod Out RotationsReplace full Cov. by diagonal matrixAs done in PCA eigen-analysis
But then “not like data”???OK, since k-means clustering (i.e. CI) is
rotation invariant
(assuming e.g. Euclidean Distance)
tMDM
![Page 20: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/20.jpg)
SigClust Gaussian null distribut’n
2nd Key Idea: Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems?
E.g. Perou 500 data:
Dimension
Sample Size
Still need to estimate param’s9674d533n9674d
![Page 21: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/21.jpg)
SigClust Gaussian null distribut’n
3rd Key Idea: Factor Analysis Model
![Page 22: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/22.jpg)
SigClust Gaussian null distribut’n
3rd Key Idea: Factor Analysis Model
Model Covariance as: Biology + Noise
Where
is “fairly low dimensional”
is estimated from background noise2NB
INB 2
![Page 23: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/23.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :2N
![Page 24: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/24.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
Reasonable model (for each gene):
Expression = Signal + Noise
2N
![Page 25: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/25.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
Reasonable model (for each gene):
Expression = Signal + Noise
“noise” is roughly Gaussian
“noise” terms essentially independent
(across genes)
2N
![Page 26: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/26.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
Model OK, since
data come from
light intensities
at colored spots
2N
![Page 27: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/27.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
For all expression values (as numbers)
(Each Entry of dxn Data matrix)
2N
![Page 28: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/28.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
For all expression values (as numbers)
Use robust estimate of scale
Median Absolute Deviation (MAD)
(from the median)
2N
![Page 29: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/29.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
For all expression values (as numbers)
Use robust estimate of scale
Median Absolute Deviation (MAD)
(from the median)
2N
![Page 30: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/30.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
For all expression values (as numbers)
Use robust estimate of scale
Median Absolute Deviation (MAD)
(from the median)
Rescale to put on same scale as s. d.:
2N
)1,0(
ˆN
data
MADMAD
![Page 31: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/31.jpg)
SigClust Estimation of Background Noise
n = 533, d = 9456
![Page 32: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/32.jpg)
SigClust Estimation of Background Noise
Hope: MostEntries are“Pure Noise, (Gaussian)”
![Page 33: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/33.jpg)
SigClust Estimation of Background Noise
Hope: MostEntries are“Pure Noise, (Gaussian)”
A Few (<< ¼)Are BiologicalSignal –Outliers
![Page 34: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/34.jpg)
SigClust Estimation of Background Noise
Hope: MostEntries are“Pure Noise, (Gaussian)”
A Few (<< ¼)Are BiologicalSignal –Outliers
How to Check?
![Page 35: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/35.jpg)
Q-Q plots
An aside:
Fitting probability distributions to data
![Page 36: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/36.jpg)
Q-Q plots
An aside:
Fitting probability distributions to data
• Does Gaussian distribution “fit”???
• If not, why not?
![Page 37: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/37.jpg)
Q-Q plots
An aside:
Fitting probability distributions to data
• Does Gaussian distribution “fit”???
• If not, why not?
• Fit in some part of the distribution?
(e.g. in the middle only?)
![Page 38: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/38.jpg)
Q-Q plotsApproaches to:
Fitting probability distributions to data
• Histograms
• Kernel Density Estimates
![Page 39: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/39.jpg)
Q-Q plotsApproaches to:
Fitting probability distributions to data
• Histograms
• Kernel Density Estimates
Drawbacks: often not best view
(for determining goodness of fit)
![Page 40: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/40.jpg)
Q-Q plots
Consider Testbed of 4 Toy Examples:
non-Gaussian!
non-Gaussian(?)
Gaussian
Gaussian?
(Will use these names several times)
![Page 41: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/41.jpg)
Q-Q plotsSimple Toy Example, non-Gaussian!
![Page 42: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/42.jpg)
Q-Q plotsSimple Toy Example, non-Gaussian(?)
![Page 43: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/43.jpg)
Q-Q plotsSimple Toy Example, Gaussian
![Page 44: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/44.jpg)
Q-Q plotsSimple Toy Example, Gaussian?
![Page 45: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/45.jpg)
Q-Q plotsNotes:
• Bimodal see non-Gaussian with histo
• Other cases: hard to see
• Conclude:
Histogram poor at assessing Gauss’ity
![Page 46: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/46.jpg)
Q-Q plotsStandard approach to checking Gaussianity
• QQ – plots
Background:
Graphical Goodness of Fit
Fisher (1983)
![Page 47: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/47.jpg)
Q-Q plots
Background: Graphical Goodness of Fit
Basis:
Cumulative Distribution Function (CDF)
xXPxF
![Page 48: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/48.jpg)
Q-Q plots
Background: Graphical Goodness of Fit
Basis:
Cumulative Distribution Function (CDF)
Probability quantile notation:
for "probability” and "quantile"
xXPxF
p q
qFp pFq 1
![Page 49: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/49.jpg)
Q-Q plots
Probability quantile notation:
for "probability” and "quantile“
Thus is called the quantile function 1F
p q
qFp pFq 1
![Page 50: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/50.jpg)
Q-Q plots
Two types of CDF:
1. Theoretical qXPqFp
![Page 51: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/51.jpg)
Q-Q plots
Two types of CDF:
1. Theoretical
2. Empirical, based on data nXX ,,1
qXPqFp
n
qXiqFp i
:#ˆˆ
![Page 52: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/52.jpg)
Q-Q plots
Direct Visualizations:
1. Empirical CDF plot:
plot
vs. grid of (sorted data) valuesq̂
nin
ip ,,1,ˆ
![Page 53: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/53.jpg)
Q-Q plots
Direct Visualizations:
1. Empirical CDF plot:
plot
vs. grid of (sorted data) values
2. Quantile plot (inverse):
plot
vs.
q̂
nin
ip ,,1,ˆ
q̂
p̂
![Page 54: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/54.jpg)
Q-Q plots
Comparison Visualizations:
(compare a theoretical with an empirical)
3.P-P plot:
plot vs.
for a grid of valuesq
p̂ p
![Page 55: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/55.jpg)
Q-Q plots
Comparison Visualizations:
(compare a theoretical with an empirical)
3.P-P plot:
plot vs.
for a grid of values
4.Q-Q plot:
plot vs.
for a grid of values
q
q̂
p̂ p
p
q
![Page 56: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/56.jpg)
Q-Q plotsIllustrative graphic (toy data set):
![Page 57: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/57.jpg)
Q-Q plotsIllustrative graphic (toy data set):
From Distribution
![Page 58: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/58.jpg)
Q-Q plotsIllustrative graphic (toy data set):
Generated data points
![Page 59: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/59.jpg)
Q-Q plotsIllustrative graphic (toy data set):
So Consider
![Page 60: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/60.jpg)
Q-Q plotsEmpirical Quantiles (sorted data points)
1q̂2q̂
3q̂
4q̂5q̂
![Page 61: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/61.jpg)
Q-Q plotsCorresponding ( matched) Theoretical Quantiles
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
p
![Page 62: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/62.jpg)
Q-Q plotsIllustrative graphic (toy data set):
Main goal of Q-Q Plot:
Display how well quantiles compare
vs.iq̂ iq ni ,,1
![Page 63: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/63.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 64: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/64.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 65: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/65.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 66: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/66.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 67: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/67.jpg)
Q-Q plotsIllustrative graphic (toy data set):
![Page 68: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/68.jpg)
Q-Q plotsIllustrative graphic (toy data set):
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
![Page 69: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/69.jpg)
Alternate TerminologyQ-Q Plots = ROC Curves
Recall “Receiver Operator Characteristic”
But Different Goals:
Q-Q Plots: Look for “Equality”
ROC curves: Look for “Differences”
![Page 70: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/70.jpg)
Alternate TerminologyQ-Q Plots = ROC Curves
P-P Plots = “Precision-Recall” Curves
Highlights Different Distributional Aspects
Statistical Folklore: Q-Q Highlights Tails,
So Usually More Useful
![Page 71: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/71.jpg)
Q-Q plotsnon-Gaussian! departures from line?
![Page 72: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/72.jpg)
Q-Q plotsnon-Gaussian! departures from line?
• Seems different from line?
• 2 modes turn into wiggles?
• Less strong feature
• Been proposed to study modality
![Page 73: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/73.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
![Page 74: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/74.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
• Seems different from line?
• Harder to say this time?
• What is signal & what is noise?
• Need to understand sampling variation
![Page 75: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/75.jpg)
Q-Q plotsGaussian? departures from line?
![Page 76: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/76.jpg)
Q-Q plotsGaussian? departures from line?
• Looks much like?
• Wiggles all random variation?
• But there are n = 10,000 data points…
• How to assess signal & noise?
• Need to understand sampling variation
![Page 77: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/77.jpg)
Q-Q plotsNeed to understand sampling variation
• Approach: Q-Q envelope plot
![Page 78: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/78.jpg)
Q-Q plotsNeed to understand sampling variation
• Approach: Q-Q envelope plot– Simulate from Theoretical Dist’n
– Samples of same size
![Page 79: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/79.jpg)
Q-Q plotsNeed to understand sampling variation
• Approach: Q-Q envelope plot– Simulate from Theoretical Dist’n
– Samples of same size
– About 100 samples gives
“good visual impression”
![Page 80: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/80.jpg)
Q-Q plotsNeed to understand sampling variation
• Approach: Q-Q envelope plot– Simulate from Theoretical Dist’n
– Samples of same size
– About 100 samples gives
“good visual impression”
– Overlay resulting 100 QQ-curves
– To visually convey natural sampling variation
![Page 81: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/81.jpg)
Q-Q plotsnon-Gaussian! departures from line?
![Page 82: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/82.jpg)
Q-Q plotsnon-Gaussian! departures from line?
• Envelope Plot shows:
• Departures are significant
• Clear these data are not Gaussian
• Q-Q plot gives clear indication
![Page 83: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/83.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
![Page 84: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/84.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
• Envelope Plot shows:
• Departures are significant
• Clear these data are not Gaussian
• Recall not so clear from e.g. histogram
• Q-Q plot gives clear indication
• Envelope plot reflects sampling variation
![Page 85: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/85.jpg)
Q-Q plotsGaussian? departures from line?
![Page 86: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/86.jpg)
Q-Q plotsGaussian? departures from line?
• Harder to see
• But clearly there
• Conclude non-Gaussian
• Really needed n = 10,000 data points…
(why bigger sample size was used)
• Envelope plot reflects sampling variation
![Page 87: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/87.jpg)
Q-Q plotsWhat were these distributions?
• Non-Gaussian!– 0.5 N(-1.5,0.752) + 0.5 N(1.5,0.752)
• Non-Gaussian (?)– 0.4 N(0,1) + 0.3 N(0,0.52) + 0.3 N(0,0.252)
• Gaussian
• Gaussian?– 0.7 N(0,1) + 0.3 N(0,0.52)
![Page 88: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/88.jpg)
Q-Q plotsNon-Gaussian! .5 N(-1.5,0.752) + 0.5 N(1.5,0.752)
![Page 89: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/89.jpg)
Q-Q plotsNon-Gaussian (?) 0.4 N(0,1) + 0.3 N(0,0.52) + 0.3 N(0,0.252)
![Page 90: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/90.jpg)
Q-Q plotsGaussian
![Page 91: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/91.jpg)
Q-Q plotsGaussian? 0.7 N(0,1) + 0.3 N(0,0.52)
![Page 92: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/92.jpg)
Q-Q plotsVariations on Q-Q Plots:
For theoretical distribution: 2,N
qqXPp
![Page 93: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/93.jpg)
Q-Q plotsVariations on Q-Q Plots:
For theoretical distribution:
Solving for gives
Where is the Standard Normal Quantile
2,N
qqXPp
q
z
zpq 1
![Page 94: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/94.jpg)
Q-Q plots
Variations on Q-Q Plots:
Solving for gives
So Q-Q plot against Standard Normal
is linear
With slope and intercept
q
zpq 1
![Page 95: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/95.jpg)
Q-Q plots
Variations on Q-Q Plots:
• Can replace Gaussian with other dist’ns
• Can compare 2 theoretical distn’s
• Can compare 2 empirical distn’s
(i.e. 2 sample version of Q-Q Plot)
( = ROC curve)
![Page 96: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/96.jpg)
SigClust Estimation of Background Noise
n = 533, d = 9456
![Page 97: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/97.jpg)
SigClust Estimation of Background Noise
• Overall distribution has strong kurtosis• Shown by height of kde relative to
MAD based Gaussian fit• Mean and Median both ~ 0• SD ~ 1, driven by few large values• MAD ~ 0.7, driven by bulk of data
![Page 98: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/98.jpg)
SigClust Estimation of Background Noise
• Central part of distribution
“seems to look Gaussian”• But recall density does not provide
useful diagnosis of Gaussianity• Better to look at Q-Q plot
![Page 99: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/99.jpg)
SigClust Estimation of Background Noise
![Page 100: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/100.jpg)
SigClust Estimation of Background Noise
• Distribution clearly not Gaussian• Except near the middle• Q-Q curve is very linear there
(closely follows 45o line)• Suggests Gaussian approx. is good there• And that MAD scale estimate is good
(Always a good idea to do such diagnostics)
![Page 101: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/101.jpg)
SigClust Estimation of Background Noise
Now Check Effect of Using SD, not MAD
![Page 102: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/102.jpg)
SigClust Estimation of Background Noise
• Checks that estimation of matters• Show sample s.d. is indeed too large• As expected• Variation assessed by Q-Q envelope plot• Shows variation not negligible• Not surprising with n ~ 5 million
![Page 103: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/103.jpg)
SigClust Gaussian null distribut’n
Estimation of Biological Covariance :
Keep only “large” eigenvalues
Defined as
So for null distribution, use eigenvalues:
B
d ,,, 21 2Nj
),max(,),,max( 221 NdN
![Page 104: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/104.jpg)
SigClust Estimation of Eigenval’s
![Page 105: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/105.jpg)
SigClust Estimation of Eigenval’s
All eigenvalues > !Suggests biology is very strong here!I.e. very strong signal to noise ratioHave more structure than can analyze
(with only 533 data points)Data are very far from pure noiseSo don’t actually use Factor Anal. ModelInstead end up with estim’d eigenvalues
2N
![Page 106: SWISS Score](https://reader035.vdocuments.us/reader035/viewer/2022081513/56812b0b550346895d8ef194/html5/thumbnails/106.jpg)
SigClust Estimation of Eigenval’s
Do we need the factor model? Explore this with another data set
(with fewer genes) This time:
n = 315 cases d = 306 genes