geometric statistics for high-dimensional data …...geometric statistics for high-dimensional data...

29
Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint work with Lindsey Dietz, Megan Heyman, Subhabrata (Subho) Majumdar, and Ujjal Mukherjee April 25, 2018

Upload: others

Post on 07-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Geometric Statistics for High-DimensionalData Analysis

Snigdhansu Chatterjee

School of Statistics, University of Minnesota

Joint work with Lindsey Dietz, Megan Heyman, Subhabrata (Subho) Majumdar,and Ujjal Mukherjee

April 25, 2018

Page 2: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Major contributors

Page 3: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 4: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles

I Suppose X ∈ R is a random variable.I For any α ∈ (0,1), the αth quantile Qα is the number below

which X is observed with probability α, i .e.Qα = inf{q : P [X ≤ q] ≥ α.

Theorem

If X is (absolutely) continuous with cumulative distributionfunction F (·), then F (X ) ∼ Uniform(0,1), and there is aone-to-one relationship between α and Qα.

Page 5: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.

I (An extension) The αth quantile Qα is the (unique)minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 6: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 7: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 8: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.

I Define quantiles in any inner-product space as minimizersof Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 9: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate quantiles: an alternative view

I The median is the (unique) minimizer of Ψ(q) = E|X − q|.I (An extension) The αth quantile Qα is the (unique)

minimizer of

Ψ(q) = E{|X − q|+ (2α− 1)(X − q)}.

I (Alternative notation) Define u = 2α− 1 ∈ (−1,1). Theuth quantile Qu is the (unique) minimizer of

Ψ(q) = E{|X − q|+ u(X − q)}

I = E{||X − q||+ < u,X − q >}.I Define quantiles in any inner-product space as minimizers

of Ψu(q) = E{||X − q||+ < u,X − q >}. (Haldane (1948),Chaudhuri (1996).)

Page 10: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Univariate to multivariate quantiles

Univariate quantiles:

For every u ∈ {x : ||x || < 1} ⊂ R, Q(u) minimizesΨu(q) = E [||X − q||+ < u,X − q >].

Write x = xuu/||u||+ xu⊥ .Generally, for some λ ≥ 0, the generalized spatial quantile(GSQ) are:

1. indexed by vectors in the unit ball u ∈ Bp = {x : ||x || < 1},and

2. the u-th quantile Q(u) is the minimizer of

Ψuλ(q) = E[|Xu − qu|

{1 + λ(Xu − qu)−2||Xu⊥ − qu⊥ ||2

}1/2

+||u||(Xu − qu)].

Page 11: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Bivariate quantiles

−4 −2 0 2 4

−4−2

02

4

Support of Distn

Q(u)

−1.0 −0.5 0.0 0.5 1.0

Domain

u

Page 12: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Bahadur representation of generalized spatial quantiles

TheoremThe following asymptotic Bahadur-type representation holdswith probability 1 for any u:

n1/2(Q(u)−Q(u)) = −n−1/2H−1Sn + O(n−(1+s)/4(log n)1/2(log log n)(1+s)/4)

as n→∞.

(Apologies for not including the details.)

Page 13: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Projection quantiles

Generalized spatial quantiles minimize:

Ψuλ(q) = E[|Xu − qu|

{1 + λ(Xu − qu)−2||Xu⊥ − qu⊥ ||2

}1/2+ ||u||(Xu − qu)

].

Set λ = 0 to get projection quantiles.

I Computationally extremely simple, no limitations fromsample size and dimension (high p, low n allowed).

I Projection quantiles based confidence sets have exactcoverage.

I Works on infinite-dimensional spaces.

Page 14: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Projection quantiles

Theorem

Projection quantiles have a one-to-one relationship with the unitball, like univariate quantiles.

Page 15: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Example: simulated data plots

Figure: Simulated data with a few GSQ (covered areas are deliberatelydifferent)

Page 16: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 17: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

GSQ-depths are great for classification

Figure: A simulated 2-class classification problem with GSQ-depth classifier

Page 18: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

GSQ-depth based classification: some results

Method CPU Time AccuracyGSQ 3.67 0.925Random Forest 16714.20 0.895SVM 966.86 0.842LDA 0.28 0.74Logit 0.35 0.69

Table: Arcene classification without feature selection (neural nets didnot converge)

Page 19: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 20: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

The data on monsoons

Figure: Air from the eastern Indian Ocean (yellow) and airdescending over Arabia (blue) converge in the Somali jet. Lowpressure at 30S. {Courtesy: UMn Climate Expeditions team.}

Page 21: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Variable dropped en(S−j )- Tmax 0.1490772- X120W 0.2190159- ELEVATION 0.2288938- X120E 0.2290021- ∆TT_Deg_Celsius 0.2371846- X80E 0.2449195- LATITUDE 0.2468698- TNH 0.2538924- Nino34 0.2541503- X10W 0.2558397- LONGITUDE 0.2563105- X100E 0.2565388- EAWR 0.2565687- X70E 0.2596766- v_wind_850 0.2604214- X140E 0.2609039- X40W 0.261159- SolarFlux 0.2624313- X160E 0.2626321- EPNP 0.2630901- TempAnomaly 0.2633658- u_wind_850 0.2649837- WP 0.2660394<none> 0.2663496- POL 0.2677756- Tmin 0.268231- X20E 0.2687891- EA 0.2690791- u_wind_200 0.2692731- u_wind_600 0.2695297- SCA 0.2700276- DMI 0.2700579- PNA 0.2715089- v_wind_200 0.2731708- v_wind_600 0.2748239- NAO 0.2764488

Table: Ordered values of en(S−j ) after dropping the j-th variable fromthe full model in the Indian summer precipitation data

Page 22: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

2004 2006 2008 2010 2012

−3

−2

−1

01

23

year

Bia

s

● ●●

Full modelReduced model

● ●

2004 2006 2008 2010 2012

02

46

8

year

MS

E

● ● ●● ●

●●

Full modelReduced model

(a) (b)

Figure: Comparing full model rolling predictions with reducedmodels: (a) Bias across years, (b) MSE across years.

Page 23: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

−2 0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

0.5

Year 2012

log(PRCP+1)

dens

ity

TruthFull model predReduced model pred

2012

●●

●●

Positive residnegative resid

(c) (d)

Figure: Comparing full model rolling predictions with reducedmodels: (c) density plots for 2012, (d) stationwise residuals for 2012

Page 24: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Outline

Quantiles: univariate, multivariate

Geometric quantiles for classification

The Indian Summer Monsoons: GSQ for feature selection

fMRI data: GSQ for spatio-temporal modeling

Page 25: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

A brief outline

I We consider 19 tests subjects, with 2 kinds of visualstasks.

I Each subject went through 9 runs, where they saw faces orscrambled images, and had to react.

I We fit a spatio-temporal model. Temporally, we fit a AR(5)with quadratic drift. Spatially, we consider different layersnearest neighbor voxels.

I We measure the degree of spatial dependency in differentregions of the brain.

I The figures below are for one subject in one run.

Page 26: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x = 48

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

y = 7

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

z = 12

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

z = 8

Figure: Plot of significant p-values at 95% confidence level at thespecified cross-sections.

Page 27: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Figure: A smoothed surface obtained from the p-values clearlyshows high spatial dependence in right optic nerve, auditory nerves,auditory cortex and left visual cortex areas

Page 28: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Acknowledgment:

I This research is partially supported by the NationalScience Foundation (NSF) under grants # DMS-1622483,# DMS-1737918, and by the National Aeronautics andSpace Administration (NASA).

I This research is partially supported by the Institute on theEnvironment (IonE).

Page 29: Geometric Statistics for High-Dimensional Data …...Geometric Statistics for High-Dimensional Data Analysis Snigdhansu Chatterjee School of Statistics, University of Minnesota Joint

Thank you