interesting statistical problem for hdlss data: when clusters seem to appear e.g. found by...

118
Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question asked by Neil Hayes Define appropriate statistical significance? Can we calculate it?

Upload: stanley-day

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Interesting Statistical Problem

For HDLSS dataWhen clusters seem to appear

Eg found by clustering method

How do we know they are really thereQuestion asked by Neil Hayes

Define appropriate statistical significance

Can we calculate it

Statistical Significance of Clusters

Basis of SigClust Approach

What defines A Single ClusterA Gaussian distribution (Sarle amp Kou 1993)

So define SigClust test based on2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically

SigClust Gaussian null distributrsquon

Which Gaussian (for null)

Standard (sphered) normalNo not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian

Need Gaussian more like dataNeed Full modelChallenge Parameter EstimationRecall HDLSS Context

SigClust Gaussian null distributrsquon

Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI

When Data are (rigidly) shiftedCI remains the same

So enough to simulate with mean 0Other uses of invariance ideas

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis

But then ldquonot like datardquoOK since k-means clustering (ie CI) is

rotation invariant

(assuming eg Euclidean Distance)

tMDM

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems

Eg Perou 500 data

Dimension

Sample Size

Still need to estimate paramrsquos9674d533n

9674d

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 2: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Statistical Significance of Clusters

Basis of SigClust Approach

What defines A Single ClusterA Gaussian distribution (Sarle amp Kou 1993)

So define SigClust test based on2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically

SigClust Gaussian null distributrsquon

Which Gaussian (for null)

Standard (sphered) normalNo not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian

Need Gaussian more like dataNeed Full modelChallenge Parameter EstimationRecall HDLSS Context

SigClust Gaussian null distributrsquon

Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI

When Data are (rigidly) shiftedCI remains the same

So enough to simulate with mean 0Other uses of invariance ideas

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis

But then ldquonot like datardquoOK since k-means clustering (ie CI) is

rotation invariant

(assuming eg Euclidean Distance)

tMDM

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems

Eg Perou 500 data

Dimension

Sample Size

Still need to estimate paramrsquos9674d533n

9674d

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 3: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

Which Gaussian (for null)

Standard (sphered) normalNo not realisticRejection not strong evidence for clusteringCould also get that from a-spherical Gaussian

Need Gaussian more like dataNeed Full modelChallenge Parameter EstimationRecall HDLSS Context

SigClust Gaussian null distributrsquon

Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI

When Data are (rigidly) shiftedCI remains the same

So enough to simulate with mean 0Other uses of invariance ideas

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis

But then ldquonot like datardquoOK since k-means clustering (ie CI) is

rotation invariant

(assuming eg Euclidean Distance)

tMDM

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems

Eg Perou 500 data

Dimension

Sample Size

Still need to estimate paramrsquos9674d533n

9674d

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 4: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

Estimated Mean (of Gaussian distrsquon)1st Key Idea Can ignore thisBy appealing to shift invariance of CI

When Data are (rigidly) shiftedCI remains the same

So enough to simulate with mean 0Other uses of invariance ideas

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis

But then ldquonot like datardquoOK since k-means clustering (ie CI) is

rotation invariant

(assuming eg Euclidean Distance)

tMDM

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems

Eg Perou 500 data

Dimension

Sample Size

Still need to estimate paramrsquos9674d533n

9674d

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 5: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out RotationsReplace full Cov by diagonal matrixAs done in PCA eigen-analysis

But then ldquonot like datardquoOK since k-means clustering (ie CI) is

rotation invariant

(assuming eg Euclidean Distance)

tMDM

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems

Eg Perou 500 data

Dimension

Sample Size

Still need to estimate paramrsquos9674d533n

9674d

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 6: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

2nd Key Idea Mod Out Rotations

Only need to estimate diagonal matrix

But still have HDLSS problems

Eg Perou 500 data

Dimension

Sample Size

Still need to estimate paramrsquos9674d533n

9674d

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 7: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

3rd Key Idea Factor Analysis Model

Model Covariance as Biology + Noise

Where

is ldquofairly low dimensionalrdquo

is estimated from background noise2NB

INB 2

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 8: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

Estimation of Background Noise 2N

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 9: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

Estimation of Background Noise

Reasonable model (for each gene)

Expression = Signal + Noise

ldquonoiserdquo is roughly Gaussian

ldquonoiserdquo terms essentially independent

(across genes)

2N

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 10: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Background Noise

Hope MostEntries areldquoPure Noise (Gaussian)rdquo

A Few (ltlt frac14)Are BiologicalSignal ndashOutliers

How to Check

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 11: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plots

Background Graphical Goodness of Fit

Basis

Cumulative Distribution Function (CDF)

Probability quantile notation

for probabilityrdquo and quantile

xXPxF

p q

qFp pFq 1

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 12: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plots

Two types of CDF

1 Theoretical

2 Empirical based on data nXX 1

qXPqFp

n

qXiqFp i

ˆˆ

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 13: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plots

Comparison Visualizations

(compare a theoretical with an empirical)

3P-P plot

plot vs

for a grid of values

4Q-Q plot

plot vs

for a grid of values

q

q

p p

p

q

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 14: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 15: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsIllustrative graphic (toy data set)

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 16: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsIllustrative graphic (toy data set)

Empirical Qs near Theoretical Qs

when

Q-Q curve is near 450 line

(general use of Q-Q plots)

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 17: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Alternate TerminologyQ-Q Plots = ROC curves

P-P Plots = ldquoPrecision Recallrdquo Curves

Highlights Different Distributional Aspects

Statistical Folklore Q-Q Highlights Tails

So Usually More Useful

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 18: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 19: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsGaussian departures from line

bull Looks much like

bull Wiggles all random variation

bull But there are n = 10000 data pointshellip

bull How to assess signal amp noise

bull Need to understand sampling variation

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 20: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsNeed to understand sampling variation

bull Approach Q-Q envelope plotndash Simulate from Theoretical Distrsquon

ndash Samples of same size

ndash About 100 samples gives

ldquogood visual impressionrdquo

ndash Overlay resulting 100 QQ-curves

ndash To visually convey natural sampling variation

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 21: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsGaussian departures from line

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 22: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Q-Q plotsGaussian departures from line

bull Harder to see

bull But clearly there

bull Conclude non-Gaussian

bull Really needed n = 10000 data pointshellip

(why bigger sample size was used)

bull Envelope plot reflects sampling variation

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 23: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Background Noise

n = 533 d = 9456

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 24: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Background Noise

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 25: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Background Noise

bull Distribution clearly not Gaussianbull Except near the middlebull Q-Q curve is very linear there

(closely follows 45o line)bull Suggests Gaussian approx is good therebull And that MAD scale estimate is good

(Always a good idea to do such diagnostics)

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 26: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distributrsquon

Estimation of Biological Covariance

Keep only ldquolargerdquo eigenvalues

Defined as

So for null distribution use eigenvalues

B

d 21 2Nj

)max()max( 221 NdN

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 27: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 28: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Eigenvalrsquos

All eigenvalues gt Suggests biology is very strong hereIe very strong signal to noise ratioHave more structure than can analyze

(with only 533 data points)Data are very far from pure noiseSo donrsquot actually use Factor Anal ModelInstead end up with estimrsquod eigenvalues

2N

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 29: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Eigenvalrsquos

Do we need the factor model Explore this with another data set

(with fewer genes) This time

n = 315 cases d = 306 genes

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 30: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Eigenvalrsquos

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 31: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Estimation of Eigenvalrsquos

Try another data set with fewer genesThis time

First ~110 eigenvalues gt Rest are negligibleSo threshold smaller ones at

2N

2N

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 32: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distribution - Simulation

Now simulate from null distribution using

where (indep)

Again rotation invariance makes this work

(and location invariance)

jij NX 0~

id

i

X

X

1

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 33: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Gaussian null distribution - Simulation

Then compare data CI

With simulated null population CIs

bull Spirit similar to DiProPermbull But now significance happens for

smaller values of CI

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 34: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

An example (details to follow)

P-val = 00045

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 35: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 36: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Modalities

Two major applications

I Test significance of given clusterings

(eg for those found in heat map)

(Use given class labels)

IITest if known cluster can be further split

(Use 2-means class labels)

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 37: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Real Data Results

Analyze Perou 500 breast cancer data

(large cross study combined data set)

Current folklore 5 classes Luminal A Luminal B Normal Her 2 Basal

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 38: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Perou 500 PCA View ndash real clusters

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 39: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Perou 500 DWD Dirrsquons View ndash real clusters

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 40: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Perou 500 ndash Fundamental Question

Are Luminal A amp Luminal B really distinct clusters

Famous forFar Different Survivability

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 41: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Results for Luminal A vs Luminal B

P-val = 00045

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 42: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 43: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Results for Luminal A vs Luminal B

Get p-values from Empirical Quantile

From simulated sample CIs

Fit Gaussian Quantile Donrsquot ldquobelieve theserdquo But useful for comparison Especially when Empirical Quantile = 0

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 44: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Results for Luminal A vs Luminal B

I Test significance of given clusteringsbull Empirical p-val = 0

ndash Definitely 2 clusters

bull Gaussian fit p-val = 00045ndash same strong evidence

bull Conclude these really are two clusters

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 45: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Results for Luminal A vs Luminal B

II Test if known cluster can be further split

bull Empirical p-val = 0ndash definitely 2 clusters

bull Gaussian fit p-val = 10-10

ndash Stronger evidence than abovendash Such comparison is value of Gaussian fitndash Makes sense (since CI is min possible)

bull Conclude these really are two clusters

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 46: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Real Data Results

Summary of Perou 500 SigClust ResultsLum amp Norm vs Her2 amp Basal p-val = 10-19

Luminal A vs B p-val = 00045Her 2 vs Basal p-val = 10-10

Split Luminal A p-val = 10-7

Split Luminal B p-val = 0058Split Her 2 p-val = 010Split Basal p-val = 0005

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 47: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Real Data Results

Summary of Perou 500 SigClust Resultsbull All previous splits were realbull Most not able to split furtherbull Exception is Basal already knownbull Chuck Perou has good intuition

(insight about signal vs noise)bull How good are others

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 48: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Real Data Results

Experience with Other Data Sets Similar

Smaller data sets less power

Gene filtering more power

Lung Cancer more distinct clusters

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 49: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Real Data Results

Some Personal Observations

Experienced Analysts Impressively Good

SigClust can save them time

SigClust can help them with skeptics

SigClust essential for non-experts

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 50: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Overview

Works Well When Factor Part Not Used

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 51: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 52: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 53: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Overview

Works Well When Factor Part Not Used

Sample Eigenvalues Always Valid

But Can be Too Conservative

Above Factor Threshold Anti-Conservative

Problem Fixed by Soft Thresholding

(Huang et al 2014)

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 54: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

SigClust Open Problems

Improved Eigenvalue Estimation

More attention to Local Minima in 2-

means Clustering

Theoretical Null Distributions

Inference for k gt 2 means Clustering

Multiple Comparison Issues

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 55: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 56: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 57: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always

Workhorse Method for Much Insight Laws of Large Numbers (Consistency) Central Limit Theorems (Quantify Errors)

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 58: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 59: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 60: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 61: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Ie Uses limiting operations Almost always Occasional misconceptions

Indicates behavior for large samples Thus only makes sense for ldquolargerdquo samples Models phenomenon of ldquoincreasing datardquo So other flavors are useless

nlim

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 62: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 63: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insights

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 64: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 65: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

0limlimlimlim

dndn

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 66: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics

Modern Mathematical Statistics Based on asymptotic analysis Real Reasons

Approximation provides insightsCan find simple underlying structureIn complex situations

Thus various flavors are fine

Even desirable (find additional insights)

0limlimlimlim

dndn

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 67: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Personal Observations

HDLSS world ishellip

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 68: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 69: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

HDLSS Asymptotics

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 70: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

Personal Observations

HDLSS world ishellip

Surprising (many times)

[Think Irsquove got it and then hellip]

Mathematically Beautiful ()

Practically Relevant

HDLSS Asymptotics

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 71: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

Study Ideas From

Hall Marron and Neeman (2005)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 72: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquond

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 73: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Where are Data

Near Peak of Density

Thanks to psycnetapaorg

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 74: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

(measure how close to peak)

d

d

dd

d

IN

Z

Z

Z 0~1

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 75: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 76: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

Euclidean Distance to Origin (as )

d

d

dd

d

IN

Z

Z

Z 0~1

)1(pOdZ

212

1

2 ~ dOdZZ pd

d

j j

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 77: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 78: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 79: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

-Data lie roughly on surface of sphere

with radius

- Yet origin is point of highest density

- Paradox resolved by

density w r t Lebesgue Measure

d

)1(pOdZ

d

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 80: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 81: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 82: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 83: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 84: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure Consider Volume of Unit Sphere in Find As Integral In Sphrsquol Coordinates

Look At Integrand wrt Can Show Puts ~ All Weight Near

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 85: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

- Paradox resolved by

density w r t Lebesgue Measure

Lebesgue Measure Pushes Mass Out Density Pulls Data In Is The Balance Point

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 86: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 87: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 88: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 89: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

As

Important Philosophical Consequence

ldquoAverage Peoplerdquo

Parents Lament

Why Canrsquot I Have Average Children

Theorem Impossible (over many factors)

d )1(pOdZ

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 90: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

d dd INZ 0~21Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 91: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

Euclidean Dist Between and

(as )

Distance tends to non-random constant

d

d

dd INZ 0~2

)1(221 pOdZZ

1Z

1Z 2Z

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 92: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

)1(221 pOdZZ

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 93: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 94: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

Distance tends to non-random constant

bullFactor since

Can extend to Where do they all go

(we can only perceive 3 dimrsquons)

)1(221 pOdZZ

nZZ

1

22

2121 XsdXsdXXsd

2

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 95: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 96: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

Ever Wonder Why

o Perceptual System from Ancestorso They Needed to Find Foodo Food Exists in 3-d World

(we can only perceive 3 dimrsquons)

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 97: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

As Vectors From Origin

Thanks to memberstripodcom

d

d

dd INZ 0~21Z

1198851

119885 2

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 98: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 99: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 100: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 101: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asymptotics Simple Paradoxes

For dimrsquoal Standard Normal distrsquon

indep of

High dimrsquoal Angles (as )

- Everything is orthogonal

- Where do they all go

(again our perceptual limitations)

- Again 1st order structure is non-random

d

d

dd INZ 0~2

)(90 2121

dOZZAngle p

1Z

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 102: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 103: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension n

d ddn INZZ 0~1

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 104: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

n

d ddn INZZ 0~1

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 105: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 106: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

(Modulo Rotation)

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 107: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Subspace Generated by Data

Hyperplane through 0

of dimension

Points are ldquonearly equidistant to 0rdquo

amp dist

Within plane can

ldquorotate towards Unit Simplexrdquo

All Gaussian data sets are

ldquonear Unit Simplex Verticesrdquo

ldquoRandomnessrdquo appears

only in rotation of simplex

n

d ddn INZZ 0~1

d

d

Hall Marron amp Neeman (2005)

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 108: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane1n

d ddn INZZ 0~1

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 109: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

1n

d ddn INZZ 0~1

d2~

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 110: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 111: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 112: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Representrsquon

Assume let

Study Hyperplane Generated by Data

dimensional hyperplane

Points are pairwise equidistant dist

Points lie at vertices of

ldquoregular hedronrdquo

Again ldquorandomness in datardquo is only in rotation

Surprisingly rigid structure in random data

1n

d ddn INZZ 0~1

d2

d2~

n

>

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 113: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 114: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 115: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screen

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 116: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquo

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 117: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View study ldquorigidity after rotationrdquobull Simple 3 point data setsbull In dimensions d = 2 20 200 20000bull Generate hyperplane of dimension 2bull Rotate that to plane of screenbull Rotate within plane to make ldquocomparablerdquobull Repeat 10 times use different colors

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)
Page 118: Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question

HDLSS Asyrsquos Geometrical Represenrsquotion

Simulation View Shows ldquoRigidity after Rotationrdquo

  • Interesting Statistical Problem
  • Statistical Significance of Clusters
  • SigClust Gaussian null distributrsquon
  • SigClust Gaussian null distributrsquon (2)
  • SigClust Gaussian null distributrsquon (3)
  • SigClust Gaussian null distributrsquon (4)
  • SigClust Gaussian null distributrsquon (5)
  • SigClust Gaussian null distributrsquon (6)
  • SigClust Gaussian null distributrsquon (7)
  • SigClust Estimation of Background Noise
  • Q-Q plots
  • Q-Q plots (2)
  • Q-Q plots (3)
  • Q-Q plots (4)
  • Q-Q plots (5)
  • Q-Q plots (6)
  • Alternate Terminology
  • Q-Q plots (7)
  • Q-Q plots (8)
  • Q-Q plots (9)
  • Q-Q plots (10)
  • Q-Q plots (11)
  • SigClust Estimation of Background Noise (2)
  • SigClust Estimation of Background Noise (3)
  • SigClust Estimation of Background Noise (4)
  • SigClust Gaussian null distributrsquon (8)
  • SigClust Estimation of Eigenvalrsquos
  • SigClust Estimation of Eigenvalrsquos (2)
  • SigClust Estimation of Eigenvalrsquos (3)
  • SigClust Estimation of Eigenvalrsquos (4)
  • SigClust Estimation of Eigenvalrsquos (5)
  • SigClust Gaussian null distribution - Simulation
  • SigClust Gaussian null distribution - Simulation (2)
  • An example (details to follow)
  • SigClust Modalities
  • SigClust Modalities (2)
  • SigClust Real Data Results
  • Perou 500 PCA View ndash real clusters
  • Perou 500 DWD Dirrsquons View ndash real clusters
  • Perou 500 ndash Fundamental Question
  • SigClust Results for Luminal A vs Luminal B
  • SigClust Results for Luminal A vs Luminal B (2)
  • SigClust Results for Luminal A vs Luminal B (3)
  • SigClust Results for Luminal A vs Luminal B (4)
  • SigClust Results for Luminal A vs Luminal B (5)
  • SigClust Real Data Results (2)
  • SigClust Real Data Results (3)
  • SigClust Real Data Results (4)
  • SigClust Real Data Results (5)
  • SigClust Overview
  • SigClust Overview (2)
  • SigClust Overview (3)
  • SigClust Overview (4)
  • SigClust Open Problems
  • HDLSS Asymptotics
  • HDLSS Asymptotics (2)
  • HDLSS Asymptotics (3)
  • HDLSS Asymptotics (4)
  • HDLSS Asymptotics (5)
  • HDLSS Asymptotics (6)
  • HDLSS Asymptotics (7)
  • HDLSS Asymptotics (8)
  • HDLSS Asymptotics (9)
  • HDLSS Asymptotics (10)
  • HDLSS Asymptotics (11)
  • HDLSS Asymptotics (12)
  • HDLSS Asymptotics (13)
  • HDLSS Asymptotics (14)
  • HDLSS Asymptotics (15)
  • HDLSS Asymptotics (16)
  • HDLSS Asymptotics Simple Paradoxes
  • HDLSS Asymptotics Simple Paradoxes (2)
  • HDLSS Asymptotics Simple Paradoxes (3)
  • HDLSS Asymptotics Simple Paradoxes (4)
  • HDLSS Asymptotics Simple Paradoxes (5)
  • HDLSS Asymptotics Simple Paradoxes (6)
  • HDLSS Asymptotics Simple Paradoxes (7)
  • HDLSS Asymptotics Simple Paradoxes (8)
  • HDLSS Asymptotics Simple Paradoxes (9)
  • HDLSS Asymptotics Simple Paradoxes (10)
  • HDLSS Asymptotics Simple Paradoxes (11)
  • HDLSS Asymptotics Simple Paradoxes (12)
  • HDLSS Asymptotics Simple Paradoxes (13)
  • HDLSS Asymptotics Simple Paradoxes (14)
  • HDLSS Asymptotics Simple Paradoxes (15)
  • HDLSS Asymptotics Simple Paradoxes (16)
  • HDLSS Asymptotics Simple Paradoxes (17)
  • HDLSS Asymptotics Simple Paradoxes (18)
  • HDLSS Asymptotics Simple Paradoxes (19)
  • HDLSS Asymptotics Simple Paradoxes (20)
  • HDLSS Asymptotics Simple Paradoxes (21)
  • HDLSS Asymptotics Simple Paradoxes (22)
  • HDLSS Asymptotics Simple Paradoxes (23)
  • HDLSS Asymptotics Simple Paradoxes (24)
  • HDLSS Asymptotics Simple Paradoxes (25)
  • HDLSS Asymptotics Simple Paradoxes (26)
  • HDLSS Asymptotics Simple Paradoxes (27)
  • HDLSS Asymptotics Simple Paradoxes (28)
  • HDLSS Asymptotics Simple Paradoxes (29)
  • HDLSS Asymptotics Simple Paradoxes (30)
  • HDLSS Asymptotics Simple Paradoxes (31)
  • HDLSS Asyrsquos Geometrical Representrsquon
  • HDLSS Asyrsquos Geometrical Representrsquon (2)
  • HDLSS Asyrsquos Geometrical Representrsquon (3)
  • HDLSS Asyrsquos Geometrical Representrsquon (4)
  • HDLSS Asyrsquos Geometrical Representrsquon (5)
  • HDLSS Asyrsquos Geometrical Representrsquon (6)
  • HDLSS Asyrsquos Geometrical Representrsquon (7)
  • HDLSS Asyrsquos Geometrical Representrsquon (8)
  • HDLSS Asyrsquos Geometrical Representrsquon (9)
  • HDLSS Asyrsquos Geometrical Representrsquon (10)
  • HDLSS Asyrsquos Geometrical Representrsquon (11)
  • HDLSS Asyrsquos Geometrical Represenrsquotion
  • HDLSS Asyrsquos Geometrical Represenrsquotion (2)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (3)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (4)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (5)
  • HDLSS Asyrsquos Geometrical Represenrsquotion (6)