dimensionality reduction ii: ica

57
Dimensionality Reduction II: ICA Sam Norman-Haignere Jan 21, 2016

Upload: others

Post on 26-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dimensionality Reduction II: ICA

Dimensionality Reduction II: ICASam Norman-HaignereJan 21, 2016

Page 2: Dimensionality Reduction II: ICA

•  Many signals reflect linear mixtures of multiple ‘sources’ :

⇒  Audio signals from multiple speakers

⇒  Neuroimaging measures of neural activity

(e.g. EEG, fMRI, calcium imaging)

•  Often want to recover underlying sources from the mixed signal

⇒  ICA algorithms provide general-purpose statistical machinery

(given certain key assumptions)

Motivation

Page 3: Dimensionality Reduction II: ICA

•  Several sounds being played simultaneously

•  Microphones at different locations record the mixed signal

Classic Example: Cocktail Party Problem

Page 4: Dimensionality Reduction II: ICA

•  Several sounds being played simultaneously

•  Microphones at different locations record the mixed signal

ICA can recovers individual sounds sources

⇒  Only true if # microphones ≥ # sources

http://research.ics.aalto.fi/ica/cocktail/cocktail_en.cgi

Classic Example: Cocktail Party Problem

Page 5: Dimensionality Reduction II: ICA

Why is this a classic demo?

⇒ Impressive

⇒ Assumptions of ICA fit the problem well:

1.  Sound source waveforms close to independent

2.  Audio mixing truly linear

3.  Sound source waveforms have non-Gaussian amplitude distribution

Classic Example: Cocktail Party Problem

HISTOGRAM  OF  SPEECH  WITH  GAUSSIAN  OVERLAID  

Page 6: Dimensionality Reduction II: ICA

Frequently used to denoise EEG timecourses

⇒ Artifacts (e.g. eye blinks) mostly independent of neural activity and have non-Gaussian amplitude distribution

⇒ EEG channels modeled as linear mixture of artifacts and neural activity

Neuroimaging Examples: EEG

Page 7: Dimensionality Reduction II: ICA

fMRI ‘voxels’ contain hundreds of thousands of neurons

⇒  Plausibly contain neural populations with distinct selectivity

fMRI responses (reflecting blood) approximately linear function of neural activity

⇒  Use component analysis to unmix responses from different neural populations?

Neuroimaging Examples: fMRI

Page 8: Dimensionality Reduction II: ICA

Working Example from My Research (for Convenience)

1. Man speaking2. Flushing toilet3. Pouring liquid4. Tooth-brushing5. Woman speaking6. Car accelerating7. Biting and chewing8. Laughing9. Typing10.  Car engine starting11. Running water12. Breathing13. Keys jangling14. Dishes clanking

15. Ringtone16. Microwave17. Dog barking18. Walking (hard surface)19. Road traffic20. Zipper21. Cellphone vibrating22. Water dripping23. Scratching24. Car windows25. Telephone ringing26. Chopping food27. Telephone dialing28. Girl speaking

29. Car horn30. Writing31. Computer startup sound32. Background speech33. Songbird34. Pouring water35. Pop song36. Water boiling37. Guitar38. Coughing39. Crumpling paper40. Siren…

•  Measured fMRI responses to 165 natural sounds:

Page 9: Dimensionality Reduction II: ICA

•  Measured fMRI responses to 165 natural sounds:

•  For each voxel, measure average response to each sound

Working Example from My Research (for Convenience)

20 60 100 1400

1

2

3

20 60 100 1400

1

2

3

20 60 100 1400

1

2

3

All 165 Sounds

Resp

onse

%

Sig

nal C

hang

e

Page 10: Dimensionality Reduction II: ICA

•  Measured fMRI responses to 165 natural sounds:

•  For each voxel, measure average response to each sound

•  Compile all voxel responses into a matrix

Working Example from My Research (for Convenience)

11065 Voxels

165

Soun

ds

Response Magnitude

Hypothesis: Perhaps a small number of neural populations – each with a canonical response to the sound set – explain the response of thousands of voxels?

Page 11: Dimensionality Reduction II: ICA

Linear Model of Voxel Responses

Voxel responses modeled as weighted sum of response profiles

= + + … + So

unds

Response Magnitude

v =NX

i=1

riwiw2w1 w3

r1 r2 r3v

Page 12: Dimensionality Reduction II: ICA

11065 Voxels

165

Soun

ds

11065 Voxels

Matrix Factorization

Number ofComponents

Factor response matrix into set of components, each with:1. Response profile to all 165 sounds2. Voxel weights specifying contribution of each component to each voxel

Response Magnitude

Res

pons

e Pr

ofile

1R

espo

nse

Profi

le 2

Voxel Weight Vector 1Voxel Weight Vector 2

165

Soun

ds

Page 13: Dimensionality Reduction II: ICA

11065 Voxels

165

Soun

ds

11065 Voxels

Matrix Factorization

Number ofComponents

Factor response matrix into set of components, each with:1. Response profile to all 165 sounds2. Voxel weights specifying contribution of each component to each voxel

Response Magnitude

Res

pons

e Pr

ofile

1R

espo

nse

Profi

le 2

Voxel Weight Vector 1Voxel Weight Vector 2

165

Soun

ds

SoundsResponse

SoundsResponse

SoundsResponse

SoundsResponse

Page 14: Dimensionality Reduction II: ICA

11065 Voxels

165

Soun

ds

11065 Voxels

Matrix Factorization

Number ofComponents

Factor response matrix into set of components, each with:1. Response profile to all 165 sounds2. Voxel weights specifying contribution of each component to each voxel

Response Magnitude

Res

pons

e Pr

ofile

1R

espo

nse

Profi

le 2

Voxel Weight Vector 1Voxel Weight Vector 2

165

Soun

ds

Page 15: Dimensionality Reduction II: ICA

11065 Voxels

165

Soun

ds

11065 Voxels

Matrix Factorization

Number ofComponents

Matrix approximation ill-posed (many equally good solutions)⇒ Must be constrained with additional assumptions⇒ Different techniques make different assumptions

Response Magnitude

Res

pons

e Pr

ofile

1R

espo

nse

Profi

le 2

Voxel Weight Vector 1Voxel Weight Vector 2

165

Soun

ds

Page 16: Dimensionality Reduction II: ICA

11065 Voxels

165

Soun

ds

11065 Voxels

Principal Components Analysis (PCA)

Number ofComponents

For PCA to infer underlying components, they must:1.  Have uncorrelated response profiles and voxel weights2.  Explain different amounts of response variance

Response Magnitude

Voxel Weight Vector 1Voxel Weight Vector 2

165

Soun

ds

Res

pons

e Pr

ofile

1R

espo

nse

Profi

le 2

Page 17: Dimensionality Reduction II: ICA

11065 Voxels

165

Soun

ds

11065 Voxels

Independent Components Analysis (ICA)

Number ofComponentsResponse Magnitude

Voxel Weight Vector 1Voxel Weight Vector 2

165

Soun

ds

Res

pons

e Pr

ofile

1R

espo

nse

Profi

le 2

…For ICA to infer underlying components, they must:1.  Have non-Gaussian and statistically independent voxel weights

Page 18: Dimensionality Reduction II: ICA

An Aside on Statistical IndependenceSaying that voxel weights are independent means:⇒  The weight of one component tells you nothing about the weight of another

Statistical independence a stronger assumption uncorrelatedness

⇒ All independent variables are uncorrelated⇒ Not all uncorrelated variables are independent:

p(w1, w2) = p(w1)p(w2)

r =

Page 19: Dimensionality Reduction II: ICA

An Aside on non-GaussianityMany ways for a distribution to be non-Gaussian:

Voxel Weight

More Sparse

Gaussian

Prob

abilit

y

0.2

-5 0 5

0.4

Non-Gaussian

Less Sparse

-5 0 5Voxel Weight

0.4

0.4Skewed

-5 0 5Voxel Weight

0.2

0.4

Page 20: Dimensionality Reduction II: ICA

Non-Gaussinity and Statistical IndepedenceCentral limit theorem (non-technical):

Sums of independent non-Gaussian distributions become more Gaussian

Consequence: Maximally non-Gaussian projections of the data are more likely to be sources

What if the sources have a Gaussian distribution?

Out of luck: Sums of Gaussian distributions remain Gaussian

Page 21: Dimensionality Reduction II: ICA

0

0.2

0.4

0.6

0.8

1

Piano

Man sp

eaking

Pop song

Voxe

l Res

pons

e

00.2

0.4

00.2

0.4

0

0.5

1

PianoMan speakingPo

p so

ng

Speech-­‐selec)ve  popula)on  Music-­‐selec)ve  popula)on  

Toy  Example  

Page 22: Dimensionality Reduction II: ICA

0

0.2

0.4

0.6

0.8

1

Piano

Man sp

eaking

Pop song

Voxe

l Res

pons

e

00.2

0.4

00.2

0.4

0

0.5

1

PianoMan speakingPo

p so

ng

Toy  Example  

Speech-­‐selec)ve  popula)on  Music-­‐selec)ve  popula)on  

Page 23: Dimensionality Reduction II: ICA

0

0.2

0.4

0.6

0.8

1

Piano

Man sp

eaking

Pop song

Voxe

l Res

pons

e

00.2

0.4

00.2

0.4

0

0.5

1

PianoMan speakingPo

p so

ng

Toy  Example  

Speech-­‐selec)ve  popula)on  Music-­‐selec)ve  popula)on  

Page 24: Dimensionality Reduction II: ICA

0

0.2

0.4

0.6

0.8

1

Piano

Man sp

eaking

Pop song

Voxe

l Res

pons

e

00.2

0.4

00.2

0.4

0

0.5

1

PianoMan speakingPo

p so

ng

Toy  Example  

Speech-­‐selec)ve  popula)on  Music-­‐selec)ve  popula)on  

Page 25: Dimensionality Reduction II: ICA

0

0.2

0.4

0.6

0.8

1

Piano

Man sp

eaking

Pop song

Voxe

l Res

pons

e

00.2

0.4

00.2

0.4

0

0.5

1

PianoMan speakingPo

p so

ng

Toy  Example  

Speech-­‐selec)ve  popula)on  Music-­‐selec)ve  popula)on  

Page 26: Dimensionality Reduction II: ICA

Toy  Example  

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

Page 27: Dimensionality Reduction II: ICA

Toy  Example  

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

“True”  dimensions  

Page 28: Dimensionality Reduction II: ICA

“True”  dimensions  

Limita)on  of  PCA  

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

•  PCA infers the right subspace•  But the specific directions are misalignedPCA  Dimensions  

−3 0 3

−3

0

3

Projection on PC1Pr

ojec

tion

on P

C2

Page 29: Dimensionality Reduction II: ICA

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

−3 0 3

−3

0

3

Projection on PC1Pr

ojec

tion

on P

C2

“True”  dimensions  

Limita)on  of  PCA  

PCA  Dimensions  

Can recover the “true” dimensions by rotating in the whitened PCA space

Page 30: Dimensionality Reduction II: ICA

Rota)ng  PCA  Dimensions  

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

0 0.13 0.25 0.38 0.50

0.2

0.4

0.6

0.8

1

Rotation Angle (Pi Radians)

Inde

pend

ence

(Neg

entro

py)

Page 31: Dimensionality Reduction II: ICA

0 0.13 0.25 0.38 0.50

0.2

0.4

0.6

0.8

1

Rotation Angle (Pi Radians)

Inde

pend

ence

(Neg

entro

py)

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

Rota)ng  PCA  Dimensions  

Page 32: Dimensionality Reduction II: ICA

0 0.13 0.25 0.38 0.50

0.2

0.4

0.6

0.8

1

Rotation Angle (Pi Radians)

Inde

pend

ence

(Neg

entro

py)

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

Rota)ng  PCA  Dimensions  

Page 33: Dimensionality Reduction II: ICA

0 0.13 0.25 0.38 0.50

0.2

0.4

0.6

0.8

1

Rotation Angle (Pi Radians)

Inde

pend

ence

(Neg

entro

py)

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

Rota)ng  PCA  Dimensions  

Page 34: Dimensionality Reduction II: ICA

0 0.13 0.25 0.38 0.50

0.2

0.4

0.6

0.8

1

Rotation Angle (Pi Radians)

Inde

pend

ence

(Neg

entro

py)

02

4

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

Rota)ng  PCA  Dimensions  

Page 35: Dimensionality Reduction II: ICA

−3 0 3

−3

0

3

Projection on IC1Pr

ojec

tion

on IC

20

24

0

2

4

−1

0

1

2

3

4

5

PianoMan Speaking

Pop

Song

ICA  Dimensions  

ICA rotates PCA components to maximize statistical independence / non-Gaussianity

Page 36: Dimensionality Reduction II: ICA

What  if  the  data  is  Gaussian?  

−20

2

−3−2

−10

12

3−3

−2

−1

0

1

2

3

PianoMan Speaking

Pop

Song

−3 0 3

−3

0

3

PCA1 WeightsPC

A2 W

eigh

ts

PCA  Dimensions  

Page 37: Dimensionality Reduction II: ICA

−3 0 3

−3

0

3

PCA1 Weights

PCA2

Wei

ghts

For Gaussian distributions

⇒ Projections on PCA components are circularly symmetric

⇒ No “special directions”

For non-Gaussian distributions:

⇒ Can recover latent components by searching for “special directions” that have maximally non-Gaussian projections

−3 0 3

−3

0

3

Projection on PC1

Proj

ectio

n on

PC

2

Gaussian Data Non-Gaussian Data

Page 38: Dimensionality Reduction II: ICA

A Simple 2-Step Recipe

1.  PCA: whiten data

⇒  Possibly discard low-variance components

⇒  How many components to discard?

2.  ICA: rotate whitened PCA components to maximize non-Gaussianity

⇒  How to measure non-Gaussianity?

⇒  How to maximize your non-Gaussianity measure?

Page 39: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity: Negentropy (gold standard)

•  Definition: difference in entropy from a Gaussian

•  Gaussian distribution is maximally entropic (for fixed variance)

⇒ All non-Gaussian distributions have positive negentropy

•  Maximizing negentropy closely related to minimizing mutual information

•  Cons: in practice, can be hard to measure and optimize

J(y) = H(ygauss)�H(y)

Page 40: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity: Kurtosis (approximation)

•  Definition: 4th moment of the distribution

•  Useful for sparse, ‘heavy tailed’ distributions (which are common)

E[y4]

Weights

More Sparse

Gaussian

Prob

abilit

y

0.2

-5 0 5

0.4

Non-Gaussian

Page 41: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity: Kurtosis (approximation)

•  Definition: 4th moment of the distribution

•  Useful for sparse, ‘heavy tailed’ distributions (which are common)

⇒  Many audio sources have a sparse distribution of amplitudes

E[y4]

Histogram of Speech Waveform Amplitudes

Prob

abilit

y

Waveform Amplitudes

Bell & Sejnowski, 1995

Page 42: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity: Kurtosis (approximation)

•  Definition: 4th moment of the distribution

•  Useful for sparse, ‘heavy tailed’ distributions (which are common)

⇒  Many audio sources have a sparse distribution of amplitudes

⇒  Natural images tend to be sparse (e.g. Olshausen & Field, 1997)

•  Very easy to measure and optimize

•  Cons: only useful if the source distributions are sparse, sensitive to outliers

E[y4]

Page 43: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity: Skew (approximation)

•  Definition: 3rd moment of the distribution

•  Useful for distributions with a single heavy tail

SkewedProbability

-5 0 5Weights

0.2

0.4

Gaussian Non-Gaussian

E[y3]

Page 44: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity: Skew (approximation)

•  Definition: 3rd moment of the distribution

•  Useful for distributions with a single heavy tail

•  Again easy to measure and optimize

•  Only useful if the source distributions are skewed

E[y3]

Page 45: Dimensionality Reduction II: ICA

Measuring Non-Gaussianity

Bottom line:

•  Negentropy a general-purpose measure of non-Gaussianity, but often hard to use in practice

•  Parametric measures can be more effective if tailored to the non-Gaussianity of the source distribution

Page 46: Dimensionality Reduction II: ICA

Non-Gaussianity Maximization

•  Brute-force search

⇒  e.g. iteratively rotate pairs of components to maximize non-Gaussianity

⇒  Easy-to-implement, effective in low-dimensional spaces

•  Gradient-based (many variants)

⇒ More complicated to implement, effective in high dimensions

•  All optimization algorithms attempt to find local, not global, optima

⇒  Useful to test stability of local optima

⇒  e.g. run algorithm many times from random starting points

Page 47: Dimensionality Reduction II: ICA

A Simple 2-Step Recipe Applied to fMRI Data!

1.  PCA: whiten data

⇒  Possibly discard low-variance components

⇒  How many components to discard?

2.  ICA: rotate whitened PCA components to maximize non-Gaussianity

⇒  How to measure non-Gaussianity?

⇒  How to maximize your non-Gaussianity measure?

Page 48: Dimensionality Reduction II: ICA

A Simple 2-Step Recipe Applied to fMRI Data!

1.  PCA: whiten data

⇒  Possibly discard low-variance components

⇒  How many components to discard?

2.  ICA: rotate whitened PCA components to maximize non-Gaussianity

⇒  How to measure non-Gaussianity?

⇒  How to maximize your non-Gaussianity measure?

Page 49: Dimensionality Reduction II: ICA

Choosing the Number of Components

Using cross-validation to select components

⇒ Project voxel responses onto principal components (using subset of data)

⇒ Predict responses from left-out data using different numbers of components

N=6

3 6 9 12 150.3

0.4

0.5

Number of Components

Corr

ela

tion (

r)

Over-fitting

Voxel Prediction Accuracy vs

Number of Components

Page 50: Dimensionality Reduction II: ICA

A Simple 2-Step Recipe Applied to fMRI Data!

1.  PCA: whiten data

⇒  Possibly discard low-variance components

⇒  Keep top 6 Components

2.  ICA: rotate whitened PCA components to maximize non-Gaussianity

⇒  How to measure non-Gaussianity?

⇒  How to maximize your non-Gaussianity measure?

Page 51: Dimensionality Reduction II: ICA

A Simple 2-Step Recipe Applied to fMRI Data!

1.  PCA: whiten data

⇒  Possibly discard low-variance components

⇒  Keep top 6 Components

2.  ICA: rotate whitened PCA components to maximize non-Gaussianity

⇒  Use negentropy to measure non-Gaussianity

⇒  Maximize negentropy via brute-force rotation

⇒  Feasible because:

1.  Many voxels / data points (>10,000 voxels)

2.  Low-dimensional data (just 6 dimensions)

Page 52: Dimensionality Reduction II: ICA

Rotating the Top 6 Principal Components

0

0.05

0.1

0.15

0.2

0.25

0.3

Non

-Gau

ssia

nity

(Neg

entro

py)

0

0.05

0.1

0.15

0.2

0.25

0.3

Principal Components

-45 0 45 -45 0 45

Components Discovered viaIterative Rotation

Single Pair of Components

•  Negentropy of principal components can be increased by rotation

•  Rotation algorithm discovers clear optimum

Page 53: Dimensionality Reduction II: ICA

We now have 6 dimensions, each with:

1.  A response profile (165-dimensional vector)

2.  A weight vector, specifying its contribution to each voxel

⇒ Both response profile and anatomy unconstrained

20 60 100 1400

0.5

1

All 165 Sounds

Rela

tive

resp

onse

(au)

Schematic Response Profile Weight Vector

Component Weight

Probing the Inferred Components

Page 54: Dimensionality Reduction II: ICA

All 6 inferred components have interpretable properties⇒ 2 components highly selective for speech and music, respectively

Music

Speech

Music Sounds

Music-Selective Neural Population

Location in the Brain Response to Sounds

Sounds

Res

pons

eM

agni

tude

Res

pons

eM

agni

tude

Location in the Brain Response to Sounds

Speech-Selective Neural Population

Non-Music Sounds

Speech SoundsNon-Speech Sounds

Page 55: Dimensionality Reduction II: ICA

All 6 inferred components have interpretable properties⇒ 2 components highly selective for speech and music, respectively

Music-selectivity highly diluted in raw voxels⇒  fMRI signals likely blur activity from different neural populations

PCA components differ qualitatively from ICA components, with less clear functional/anatomical properties⇒  ICA components have response profiles with substantial correlations

Page 56: Dimensionality Reduction II: ICA

Conclusions

1.  Core assumptions of ICA

⇒ Measured signals reflects linear mixture of underlying sources

⇒ Sources are non-Gaussian and statistically independent

2.  When core assumptions hold, the method is very effective, and requires few additional assumptions about the nature of the underlying sources

Page 57: Dimensionality Reduction II: ICA

Exercise: 2-Step Recipe Applied to the Cocktail Party

Dataset

⇒  5 audio soundtracks mixed from 3 sources

2-step recipe:

1.  Project data onto the top 3 principal components

2.  Iteratively rotate pairs of components to maximize negentropy

⇒ Listen to the unmixed signals!