2012 10 04_machine_learning_lecture05_icml10_tutorial

136
Sparse Modeling Theory, Algorithms and Applications Irina Rish Computational Biology Center (CBC) IBM T.J. Watson Research Center, NY Genady Grabarnik Department of Math and CS CUNY, NY

Upload: cs-center

Post on 13-Jan-2015

574 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse Modeling Theory, Algorithms and Applications

Irina RishComputational Biology Center (CBC)IBM T.J. Watson Research Center, NY

Genady

GrabarnikDepartment of Math and CSCUNY, NY

Page 2: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 3: 2012 10 04_machine_learning_lecture05_icml10_tutorial

9:00-9:40 IntroductionLasso

9:40-10:20Sparse signal recovery and Lasso: Some Theory

10:20-10:30Coffee Break

10:30-11:45 Sparse Modeling Beyond Lasso

Schedule

Page 4: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Decoding:Inference, learning

)|(maxarg* yxxx

P=

Can we recover a high-dimensional X from a low-dimensional Y?

A Common Problem

X1 X2 X3 X4Unknown

state of world X5

(Noisy) encodingY=f(X)

X2

encoding preserves information about X

Examples: Sparse signal recovery (compressed sensing, rare-event diagnosis)

Sparse model learning

Yes, if:X is structured; e.g., sparse (few Xi = 0 ) or compressible (few large Xi) /

Y1 Y2Observations

Page 5: 2012 10 04_machine_learning_lecture05_icml10_tutorial

X1 X2 X3 X4 X5 X6

Y1

Y2

Y3

Y4

vector Y of end-to-end

probe delays

Y

vector X of (unknown) link delays

M p

robe

s

Routing matrix A

1 0 1 1 0 0

0 1 1 0 1 0

1 1 0 0 0 0

1 0 0 1 0 1

Model: y = Ax + noise

N links (possible bottlenecks)

916

2

3

4

Probe 4216

Example 1: Diagnosis in Computer Networks

(Beygelzimer, Kephart and Rish 2007)

Recover sparse state (`signal’) X from noisy linear observations

Problem structure: X is nearly sparse - small number of large delays

Task: find bottlenecks (extremely slow links) using probes (M << N)

Page 6: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Data: high-dimensional, small-sample10,000 - 100,000 variables (voxels)

100s of samples (time points, or TRs)

Task: given fMRI, predict mental statesemotional: angry, happy, anxious, etc.

cognitive: reading a sentence vs viewing an image

mental disorders (schizophrenia, autism, etc.)

fMRI image courtesy of fMRI Research Center @ Columbia University

Example 2: Sparse Model Learning from fMRI

Data

Issues:

Overfitting: can we learn a predictive model that generalizes well?

Interpretability: can we identify brain areas predictive of mental states?

Page 7: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Small number of Predictive Variables ?

Sparse Statistical Models: Prediction + Interpretability

+++

+- -

--

-

Data Predictive Model y = f(x)

x - fMRI voxels,

y - mental state

sad

happy

Sparsity variable selection model interpretability

Sparsity regularization less overfitting / better prediction

Page 8: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse Linear Regression

y

=

Ax

+ noise

fMRI

data (“encoding’)rows –

samples (~500)Columns –

voxels

(~30,000)

Unknownparameters(‘signal’)

Measurements:mental states, behavior,tasks or stimuli

fMRI activation image and time-course courtesy of Steve Smith, FMRIB

Find small number of most relevant voxels

(brain areas)

Page 9: 2012 10 04_machine_learning_lecture05_icml10_tutorial

design(measurement)

matrix

Sparse Recovery in a Nutshell

Can we

recover a sparse input efficientlyfrom a

small number of measurements?

noiselessobservations sparse input

Page 10: 2012 10 04_machine_learning_lecture05_icml10_tutorial

N dimensions

K nonzeros

M

from just measurements

efficiently - by solving convex problem ( linear program)

``Compressed Sensing Surprise’’:

Given

random A

(i.i.d. Gaussian entries), can be reconstructed exactly (with high probability):

Sparse Recovery in a Nutshell

Page 11: 2012 10 04_machine_learning_lecture05_icml10_tutorial

M

In general, if

A is ``good’’

(e.g., satisfies

Restricted Isometry

Property with a proper constant), sparse can be reconstructed with M <<Nmeasurements by solving (linear program):

N dimensions

K nonzeros

Sparse Recovery in a Nutshell

Page 12: 2012 10 04_machine_learning_lecture05_icml10_tutorial

sparse input

design(measurement)

matrix

noise

noiselessobservations

observations

And what if there is noise in observations?

Sparse Recovery in a Nutshell

Page 13: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Still, can reconstruct the input accurately

(in l2-sense), for A satisfying RIP; just solve a noisy version

of our l1-optimization:

(Basis Pursuit, aka Lasso)

Sparse Recovery in a Nutshell

Page 14: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse Linear Regression vs

Sparse Signal Recovery

Both solve the same optimization problem

Both share efficient algorithms and theoretical results

However, sparse learning setting is more challenging:

We do not design the “design” matrix, but rather deal with the given data

Thus, nice matrix properties may not be satisfied (and they are hard to test on a given matrix, anyway)

We don’t really know the ground truth (``signal”) – but rather assume it is sparse (to interpret and to regularize)

Sparse learning includes a wide range of problems beyond sparse linear regression (part 2 of this tutorial)

Page 15: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 16: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 17: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Motivation: Variable Selection

Page 18: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Model Selection as Regularized Optimization

Page 19: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Bayesian Interpretation: MAP Estimation

Page 20: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Log-likelihood Losses: Examples

Page 21: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 22: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 23: 2012 10 04_machine_learning_lecture05_icml10_tutorial

lq-norm constraints for different values of q

Image courtesy of [Hastie, Friedman and Tibshirani, 2009]

What is special about l1-norm? Sparsity

+ Computational Efficiency

Page 24: 2012 10 04_machine_learning_lecture05_icml10_tutorial

−10 −8 −6 −4 −2 0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

lambda = 2lambda=1lambda=0.5

Page 25: 2012 10 04_machine_learning_lecture05_icml10_tutorial

.

Page 26: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 27: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Lasso vs

Ridge and Best-Subset in Case of Orthonormal

Designs

Image courtesy of [Hastie, Friedman and Tibshirani, 2009]

hard thresholding shrinkage soft thresholding

Page 28: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 29: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 30: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Geometric View of LARS

Image courtesy of [Hastie, 2007]

Page 31: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Piecewise Linear Solution Path: LARS vs

LASSOLARS vs

LASSO for pain perception prediction from fMRI

data [Rish, Cecchi, Baliki, Apkarian, 2010]:for illustration purposes, we use just n=9 (out of 120) samples, but p=4000 variables; LARS selects n-1=8 variables

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.06

−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

Least Angle regression (LARS)

Coe

ffici

ents

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.06

−0.05

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

LARS with LASSO modification

Coe

ffici

ents

Variable deleted

Crossingzero

Page 32: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 33: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 34: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 35: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 36: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 37: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 38: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 39: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 40: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 41: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 42: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 43: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 44: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 45: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 46: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 47: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 48: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 49: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 50: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 51: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 52: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 53: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 54: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 55: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 56: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 57: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 58: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 59: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 60: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 61: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 62: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 63: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 64: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 65: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 66: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 67: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Beyond LASSO

Elastic Net Fused LassoBlock l1-lq norms:

group Lassosimultaneous Lasso

Other likelihoods (loss functions)

Adding structurebeyond sparsity

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)

Page 68: 2012 10 04_machine_learning_lecture05_icml10_tutorial

LASSOTruth

relevant

cluster of correlated predictors

Page 69: 2012 10 04_machine_learning_lecture05_icml10_tutorial

ridge penalty

λ1 = 0, λ2

> 0

lasso penalty

λ1 > 0, λ2

= 0

Elastic Net

penalty

Page 70: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 71: 2012 10 04_machine_learning_lecture05_icml10_tutorial

subjects playing a videogame in a scanner

17 minutes

Example: Application to fMRI

Analysis

Pittsburgh Brain Activity Interpretation Competition (PBAIC-07):

24 continuous response variables, e.g.•

Annoyance•

Sadness•

Anxiety•

Dog•

Faces•

Instructions•

Correct hits

Goal: predict responses from fMRI

data

Page 72: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Higher λ2

→ selection of more voxels from correlated clusters →larger, more spatially coherent clusters

Small grouping effect:

λ2

= 0.1 Larger grouping effect:

λ2

= 2.0

Grouping Effect on PBAIC data

Predicting ‘Instructions’

(auditory stimulus)

(Carroll, Cecchi, Rish, Garg, Rao

2009)

Page 73: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Instructions VRFixation Velocity0

0.2

0.4

0.6

0.8

1

Regression Method

Tes

t C

orr

el.

OLSRidgeLASSOEN 0.1EN 2.0

Among almost equally predictive models,

Grouping Tends to Improve Model Stability

Stability is measured here by average % overlap between models for 2 runs by same subject

(Carroll, Cecchi, Rish, Garg, Rao

2009)

increasing λ2

can significantly improve model stability

Page 74: 2012 10 04_machine_learning_lecture05_icml10_tutorial

0 500 1000 15000.55

0.6

0.65

0.7

0.75

0.8

number of voxels (sparsity)

pred

ictiv

e ac

cura

cy (

corr

w/ r

espo

nse)

Pain Prediction

OLSlambda2=0.1lambda2=1lambda2=5lambda2=10

Another Application: Sparse Models of Pain Perception from fMRI

Including more correlated voxels

(increasing λ2

) often improves the prediction accuracy as well

Predicting pain ratings from fMRI

in presence of thermal pain stimulus (Rish, Cecchi, Baliki, Apkarian, BI-2010)

number of voxels

Pre

dict

ive

accu

racy

(cor

r. w

ith re

spon

se)

Best prediction

for higher λ2

Page 75: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Image courtesy of [Tibshirani et al, 2005]

Page 76: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 77: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Group Lasso: Examples

Page 78: 2012 10 04_machine_learning_lecture05_icml10_tutorial

More on Group Lasso

Page 79: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 80: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 81: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 82: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Beyond Lasso: General Log-likelihood Losses

1. Gaussian Lasso

2. Bernoulli logistic regression

4. Multivariate Gaussian Gaussian MRFs

3. Exponential-family Generalized Linear Models(includes 1 and 2)

l1-regularized M-estimators

Page 83: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Beyond LASSO

Elastic Net Fused LassoBlock l1-lq norms:

group Lassosimultaneous Lasso

Other likelihoods (loss functions)

Adding structurebeyond sparsity

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)

Page 84: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse Signal Recovery

with M-estimators

risk consistency of generalized linear models (Van de Geer, 2008)

model-selection consistency of Gaussian MRFs (Ravikumar et al, 2008a)

generalized linear models: recovery in l2-norm (non-asymptotic regime) for exponential-family noise and standard RIP conditions on the design matrix (Rish and Grabarnik, 2009)

Asymptotic consistency of general losses satisfying restricted strong convexity, with decomposable regularizers (Negahban et al., 2009)

Yes!(under proper conditions)

. Can l1-regularization accurately recover sparse signals given general log P(y|x) losses?

Page 85: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Exponential Family Distributions

naturalparameter

log-partition function

base measure

Examples: Gaussian, exponential, Bernoulli, multinomial, gamma, chi-square, beta, Weibull, Dirichlet, Poisson, etc.

Page 86: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Generalized Linear Models (GLMs)

Page 87: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Summary: Exponential Family, GLMs, and Bregman

Divergences

Exponential-Family Distributions

Generalized Linear Models Bregman

Divergences

Bijection

Theorem (Banerjee

et al, 2005):

Fitting GLM maximizing exp-family likelihoodminimizing Bregman divergence

Legendre duality:

Page 88: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 89: 2012 10 04_machine_learning_lecture05_icml10_tutorial

sparse signal

design matrixnatural parameters

noise

observations

Sparse Signal Recovery with Exponential-Family Noise

Can we recover a sparse signal from a small number of noisy observations?

Page 90: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sufficient Conditions

Noise is small:1

Restricted Isometry

Property

(RIP)

2s-sparse

3

*otherwise, different proofs for some specific cases (e.g., Bernoulli, exponential, etc. )

bounded

4*

Page 91: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 92: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Beyond LASSO

Elastic Net Fused LassoBlock l1-lq norms:

group Lassosimultaneous Lasso

Other likelihoods (loss functions)

Adding structurebeyond sparsity

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)

Page 93: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Markov Networks (Markov Random Fields)

Page 94: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Social NetworksUS senate voting data (Banerjee et al, 2008): democrats (blue)

and republicans (red)

Sparse Markov Networks in Practical Applications

Genetic NetworksRosetta Inpharmatics Compendium of gene expression profiles (Banerjee et al, 2008)

Brain Networks from fMRIMonetary reward task (Honorio et al., 2009)Drug addicts more connections in cerebellum (yellow) vs control subjects (more connections in prefrontal cortex – green)

(a) Drug addicts (b) controls

Page 95: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse MRFs

Can Predict Well

*Data @ www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/

from T. Mitchell et al., Learning to Decode Cognitive States from Brain Images,Machine Learning, 2004.

Classifying Schizophrenia(Cecchi

et al., 2009)

86% accuracy

0 50 100 150 200 250 300 350 4000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

K top voxels (ttest)cl

assi

ficat

ion

erro

r

Sentence vs. Picture: Subject 04820

sparse MRF (1.0)error

Mental state prediction (sentence vs

picture)*:(Scheinberg

and Rish, submitted)

90%

accuracy

MRF classifiers can often exploit informative interactions among variables and often outperform state-of-art linear classifiers

(e.g., SVM)

Page 96: 2012 10 04_machine_learning_lecture05_icml10_tutorial

FDR-corrected Degree Maps

Network Properties as BioMarkers

(Predictive Features)

2-sample t-test performed for each voxel

in degree maps, followed by FDR correction

Red/yellow: Normal subjects have higher

values than Schizophrenics

-

Voxel

degrees in functional networks (thresholded

covariance

matrices) are statistically significantly different in schizophrenic patients that appear to lack “hubs”

in auditory/language areas

Discriminative Network Models of Schizophrenia (Cecchi

et al., 2009)

Also, abnormal MRF connectivity observed in Alzheimer’s patients (Huang 2009), in drug addicts (Honorio

2009), etc.

Page 97: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse Inverse Covariance Selection Problem

Page 98: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Maximum Likelihood Estimation

Page 99: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 100: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Block-Coordinate Descent on the Dual Problem

Page 101: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Projected Gradient on the Dual Problem

Page 102: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Alternatives: Solving Primal Problem Directly

1.

2.

Page 103: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Additional Related Work

Page 104: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Bayesian Approach (N.Bani Asadi, K. Scheinberg and I. Rish, 2009)Assume a Bayesian prior on the regularization parameter

Find maximum a posteriory probability (MAP) solution

Selecting the Proper Regularization Parameter

“…the general issue of selecting a proper amount of regularization

for getting a right-sized structure or model has largely remained a problem with unsatisfactory solutions“

(Meinshausen

and Buehlmann

, 2008)

“asymptotic considerations give little advice on how to choose a specific penalty parameter for a given problem'‘

(Meinshausen

and Buehlmann

, 2006)

Result:more ``balanced’’ solution (False Positive vs False Negative error) than

cross-validation - too dense, and

theoretical (Meinshausen & Buehlmann 2006, Banerjee et al 2008) - too sparse

Does not require solving multiple optimization problems over data subsets as compared to the stability selection approach (Meinshausen and Buehlmann 2008)

Page 105: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Existing Approaches

Page 106: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 107: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Results on Random Networks

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

N

Fal

se N

egat

ive

Err

or

Random Networks, P=100, density=4%

Flat Prior (Reg. Likelihood)Exp. PriorTheoreticalCV

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

NF

alse

Pos

itive

Err

or

Random Networks, P=100, density=4%

Flat Prior (Reg. Likelihood)Exp. PriorTheoreticalCV

- Cross-validation (green) overfits drastically, producing almost complete C matrix

False Negatives: Missed Links False Positives: ‘Noisy’

Links

- Theoretical (black) is too conservative: misses too many edges (near-diagonal C)

- Prior-based approaches (red and blue) are much more ‘balanced’: low FP and FN

Page 108: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 109: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 110: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Sparse Matrix Factorization

X

n sa

mpl

es

p variables

~

U V

m b

asis

vec

tors

(dic

tiona

ry)

T

sparse representation

Page 111: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 112: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Supervised Dimensionality Reduction (SDR):

U1

X1 XD

YK

UL

Y1

From Variable Selection to Variable Construction

Learn a predictor (mapping from U to Y) simultaneously with dimensionality reduction

Assume there is an inherent low-dimensionalstructure in the data that is predictive about the target Y

Idea: dimensionality reduction (DR) guided by the class label may result into better predictive features than the unsupervised DR

Page 113: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Particular Mappings X U and U Y

1. F. Pereira and G. Gordon. The Support Vector Decomposition Machine,

ICML-06.Real-valued X, discrete Y (linear map from X to U, SVM for Y(U) )

2. E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning with application to clustering with side information, NIPS-02.

3. K. Weinberger, J. Blitzer and L. Saul. Distance Metric Learning for Large Margin Nearest Neighbor Classification, NIPS-05.

Real-valued X, discrete Y (linear map from X to U, nearest-neighbor Y(U))

4. K. Weinberger and G. Tesauro. Metric Learning for Kernel Regression, AISTATS-07.Real-valued X, real-valued Y (linear map from X to U, kernel regression Y(U))

5. Sajama and A. Orlitsky. Supervised Dimensionality Reduction using Mixture Models, ICML-05.Multi-type X (exp.family), discrete Y (modeled as mixture of exp-family distributions)

6. M. Collins, S. Dasgupta

and R. Schapire. A generalization of PCA to the exponential family, NIPS-01.7. A. Schein, L. Saul and L. Ungar. A generalized linear model for PCA of binary data, AISTATS-03

Unsupervised dimensionality reduction beyond Gaussian data (nonlinear GLM mappings)

8. I. Rish, G. Grabarnik, G. Cecchi, F. Pereira and G. Gordon. Closed-form Supervised Dimensionality Reduction with Generalized Linear Models, ICML-08

Page 114: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Y1

U

YK…X1 XD…

V1VD

WkW1

Example: SDR with Generalized Linear Models (Rish

et al., 2008)

E.g., in linear case, we have:

X ~ U V and Y ~ U V

Page 115: 2012 10 04_machine_learning_lecture05_icml10_tutorial

SDR outperforms unsupervised DR by 20-45%

Using proper data model (e.g., Bernoulli-SDR for binary data) matters

SDR ``gets’’ the structure (0% error), SVM does not (20% error)

Supervised DR Outperforms Unsupervised DR on Simulated Data

Generate a separable 2-D dataset U

Blow-up in D dimensional data X by adding exponential-family noise (e.g., Bernoulli)

Compare SDR w/ different noise models (Gaussian, Bernoulli) vs. unsupervised DR (UDR) followed by SVM or logistic regression

Page 116: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Real-valued data, Classification Task Predict the type of word (tools or buildings) the subject is seeing84 samples (words presented to a subject), 14043 dimensions (voxels)

Latent dimensionality L = 5, 10, 15, 20, 25

Gaussian-SDR achieves overall best performance

SDR matches SVM’s performance using only 5 dimensions, while SVDM needs 15

SDR greatly outperforms unsupervised DR followed by learning a classifier

…and on Real-Life Data from fMRI

Experiments

Page 117: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Introduction

Sparse Linear Regression: Lasso

Sparse Signal Recovery and Lasso: Some Theory

Sparse Modeling: Beyond Lasso

Consistency-improving extensions

Beyond l1-regularization (l1/lq, Elastic Net, fused Lasso)

Beyond linear model (GLMs, MRFs)

Sparse Matrix Factorizations

Beyond variable-selection: variable construction

Summary and Open Issues

Outline

Presenter
Presentation Notes
Introduction (10 min) Why do we need sparse modeling? Historical background and motivation for sparse modeling and compressive sensing. Examples from several application domains, including both statistical and signal processing applications and associated problems that involve sparsity.
Page 118: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Summary and Open Issues

Common problem: small-sample, high-dimensional inference

Feasible if the input is structured – e.g. sparse in some basis

Efficient recovery of sparse input via l1- relaxation

Sparse modeling with l1-regularization: interpretability + prediction

Beyond l1-regularization: adding more structure

Beyond Lasso: M-estimators, dictionary learning, variable construction

Open issues, still:

choice of regularization parameter?

choice of proper dictionary?

Is interpretability sparsity? (NO!)

Page 119: 2012 10 04_machine_learning_lecture05_icml10_tutorial

InterpretablePredictivePatterns

Interpretability: Much More than Sparsity?

+++

+- -

--

-

Data Predictive Model y = f(x)

x - fMRI voxels,

y - mental state

sad

happy

Page 120: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 121: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 122: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 123: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 124: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 125: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 126: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 127: 2012 10 04_machine_learning_lecture05_icml10_tutorial

References

Page 128: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Appendix A

Page 129: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Why Exponential Family Loss?

Network Management – Problem Diagnosis: binary failures - Bernoullinon-negative delays – exponential

W ebServer

Hub

DBServer

Router

Probing station

Variety of data types: real-valued, binary, nominal, non-negative, etc.

Noise model: exponential-family

Collaborative prediction: discrete rankings - multinomial

DNA microarray data analysis:Real-valued expression level – Gaussian

fMRI data analysisReal-valued voxel intensities, binary, nominal and continuous responses

Presenter
Presentation Notes
Red box on left top : which data where actually used in experiments? Color version Histograms
Page 130: 2012 10 04_machine_learning_lecture05_icml10_tutorial
Page 131: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Legendre duality:

Image courtesy of Arindam

Banerjee

Exponential-family distribution

Bregmandivergence

Page 132: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Appendix A

Page 133: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Appendix B

Page 134: 2012 10 04_machine_learning_lecture05_icml10_tutorial

*

*

Page 135: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Appendix B

Page 136: 2012 10 04_machine_learning_lecture05_icml10_tutorial

Beyond LASSO

LASSO

Elastic Net penaltyFused Lasso penaltyBlock l1-lq norms: group & multi-task penalties

Improving consistency and stability w.r.t. the sparsity

parameter choice

Adaptive LassoRelaxed LassoBootstrap LassoRandomized Lasso w/ Stability selection

Other losses(other data likelihoods)

Other penalties(structured sparsity)

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)

Generalized Linear Models (exponential family noise)

Multivariate Gaussians (Gaussian MRFs)