Download - Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Sparsity Models

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Sparsity Models 1 / 28

Topics

Standard sparse regression modelalgorithms: convex relaxation and greedy algorithmsparse recovery analysis: high level view

Some extensions (complex regularization)structured sparsitygraphical modelmatrix regularization


Modern Sparsity Analysis: Motivation

Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality

Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications

Challenges:formulation, focusing on efficient computationmathematical analysis


Standard Sparse Regression

Model: Y = X β̄ + ε

Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2

High dimensional setting: n� pSparsity: β̄ has few nonzero components

supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n





High dimensional setting: n� p

Sparsity: β̄ has few nonzero componentssupp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n





High dimensional setting: n� pSparsity: β̄ has few nonzero components

supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n


Algorithms for Standard Sparsity

L0 regularization: natural method (computationally inefficient)

β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k

L1 regularization (Lasso): convex relaxation (computationally efficient)

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]

Theoretical questions:how well can we estimate parameter β̄ (recovery performance)


Greedy Algorithms for standard sparse regularization

Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize

minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k

Forward Greedy Algorithm (OMP): select variables one by one

Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p

find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}

terminate with some criterion;output β̂ using regression with selected variables F k

Theoretical question: recovery performance?


Conditions and Results

Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?

supp(β̂) ≈ F̄?

Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?

‖β̂ − β̄‖22 ≤?

Are efficient algorithms (such as L1 or OMP) good enough?

Yes but require conditions:Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery


Conditions and Results

Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?

supp(β̂) ≈ F̄?

Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?

‖β̂ − β̄‖22 ≤?

Are efficient algorithms (such as L1 or OMP) good enough?Yes but require conditions:

Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery


KKT Condition for Lasso Solution

Lasso solution:

β̂L1 = arg minβ

[‖Y − Xβ‖22 + λ‖β‖1

]KKT condition: at β̂ = β̂L1:

Exists a sub-gradient being zero:for all j = 1, . . . ,p (Xj is the j-th column of X ):

2X>j (X β̂ − y) + λ∇|β̂j | = 0.

Subgradient of L1 norm: ∇|u| = sign(u) =

1 u > 0−1 u < 0∈ [−1,1] u = 0.

If we can find a β̂ that satisfies KKT condition, then it is Lasso solution.A slightly stronger condition implies uniqueness.


Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.

Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)


Feature Selection Consistency of Lasso

Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:

2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,

and set β̂F̄ c = 0.

Condition A:X>F̄ XF̄ is full rank

sign(β̂F̄ ) = sign(β̂F̄ )

2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .

Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)


Irrepresentable Condition

The condition

µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,

is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.

Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level

then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.

Condition similar to irrepresentable condition can be derived for OMP.


RIP Conditions

Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.

RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds

ρ+(ck̄)/ρ−(ck̄) <∞

ρ+(s) = sup{β>X>Xββ>β

: ‖β‖0 ≤ s}

ρ−(s) = inf{β>X>Xββ>β

: ‖β‖0 ≤ s}

with k̄ = |F̄ | = ‖β̄‖0.


RIP Conditions

Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds

ρ+(ck̄)/ρ−(ck̄) <∞

ρ+(s) = sup{β>X>Xββ>β

: ‖β‖0 ≤ s}

ρ−(s) = inf{β>X>Xββ>β

: ‖β‖0 ≤ s}

with k̄ = |F̄ | = ‖β̄‖0.


Results under Restricted Isometry Property

Parameter estimation under RIP:

‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),

where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible

Feature selection under RIP:neither procedure achieves feature selection consistency.

Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently


Complex Regularization: structured sparsity

Wavelet domain: sparsity pattern not random (structured)

Image domain Wavelet domain

can we take advantage of structure?


Structured Sparsity Characterization

Observation:sparsity pattern is the set of nonzero coefficientsnot all sparse patterns are equally likely

Our proposal: information theoretical characterization of “structure”:

a sparsity pattern F is associated with cost c(F )c(F ) is negative log-likelihood of F (or its multiple).

Optimization problem:

minβ‖Xβ − Y‖22 subject to ‖β‖0 + c(supp(β)) ≤ s.

c(supp(β)): cost for selecting support supp(β)‖β‖0: cost for estimation after feature selection


Example: Group Structure

Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per groupExample (m = 4)

G1 G2 · · · G4 · · · Gp/m

nodes: variablesgray nodes: selected variables (groups 1,2,4)

Assumption:coefficients are not completely randomcoefficients in each group are simultaneously (or nearlysimultaneously) zeros or nonzeros

How to take advantage of group structure?


Example: Group Structure

Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per group

Assumption:coefficients in each group are simultaneously zeros or nonzeros

Group sparsity pattern cost: ‖β‖0 + m−1‖β‖0 ln p.Standard sparsity pattern cost (for Lasso): ‖β‖0 ln pTheoretical question:

can we take advantage of group sparsity structure to improve Lasso?


Convex Relaxation for group sparsity

L1 − L2 convex relaxation (group Lasso)

β̂ = arg minβ‖Xβ − Y‖22 + λ

∑j

‖βGj‖2.

This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)

Question: what is the benefit of group Lasso formulation?


Recovery Analysis for Lasso and Group Lasso

Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:

‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)

Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))

Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)

‖β̂ − β̄‖22 = O

σ2

n( g ln(p/m)︸︷︷︸group selection

+ mg︸︷︷︸estimation after group selection

)


Group sparsity: correct group structure

50 100 150 200 250 300 350 400 450 500−2

0

2(a) Original

50 100 150 200 250 300 350 400 450 500−2

0

2(b) Lasso

50 100 150 200 250 300 350 400 450 500−2

0

2(b) Group Lasso


Group sparsity: incorrect group structure

0 100 200 300 400 500−2

0

2(a) Original

0 100 200 300 400 500−2

0

2(b) Lasso

0 100 200 300 400 500−2

0

2(c) Group Lasso


Matrix Formulation: Graphical Model Example

Learning gene interaction network structure


Formulation: Gaussian Graphical Model

Multi-dimensional Gaussian vectors: X1, . . .Xn ∼ N(µ,Σ).Precision matrix Θ = Σ−1

Non-zeros of precision matrix gives graphical model structure:

P(Xi) ∝ |Θ|exp[−1

2(Xi − µ)T Θ(Xi − µ)

].

where | · | is determinant.Estimation: L1 regularized maximum likelihood estimator

Θ̂ = arg minΘ

[− ln |Θ|+ tr(Σ̂Θ) + λ‖Θ‖1

],

‖ · ‖1: element L1 regularization to encourage sparsityΣ̂: empirical covariance matrix.

Analysis exists (feature selection and parameter estimation):techniques similar to L1 analysis but not satisfactory


Matrix Completion

user

movieM1 M2 M3 M4 M5 M6 M7 M8

U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2

m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?

require assumptions:intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT

i vj . Let X the true rating matrix

X ≈ UV T (U : m × r V : n × r)


Matrix Completion

user

movieM1 M2 M3 M4 M5 M6 M7 M8

U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2

m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?require assumptions:

intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT

i vj . Let X the true rating matrix

X ≈ UV T (U : m × r V : n × r)


Formulation

Let S = {observed (i , j) entries}Let yij be the observed values for (i , j) ∈ SLet X be the true underlying rating matrix

We want to find X to fit observed yij , assuming X is low-rank:

minX∈Rm×n

∑(i,j)∈S

(Xij − yij)2 + λ · rank(X )

.rank(X ): nonconvex function of Xconvex relaxation: trace-norm ‖X‖∗, defined as the sum of singularvalues of X .

The convex reformulation is

minX∈Rm×n

∑(i,j)∈S

(Xij − yij)2 + λ · ‖X‖∗

.Solution of trace norm regularization is low-rank.


Sparsity versus Low-rank

A vector β ∈ Rp: p parametersreduce dimension — sparsity ‖β‖0 is smallconstraint ‖β‖0 ≤ s is nonconvexconvex relaxation: convex hull of unit 1-sparse vectors, which gives L1regularization ‖β‖1 ≤ 1vector solution with L1 regularization is sparse

A matrix X ∈ Rm×n: m × n parametersreduce dimension — lowrank X =

∑rj=1 ujvT

j where uj ∈ Rm andvj ∈ Rn are vectorsnumber of parameters — no more than rm + rn.rank constraint is nonconvexconvex relaxation: convex hull of unit rank-one matrices, which givestrace-norm regularization ‖X‖∗ ≤ 1matrix solution with trace-norm regularization is low-rank


Matrix Regularization Example: mixed sparsity and low rank

Y (observed) = XL(low-rank) + XS(sparse)

[X̂S, X̂L] = arg min

12µ‖(XS + XL)− Y‖22 + λ‖XS‖1 + ‖XL‖∗︸︷︷︸

trace norm

.trace norm: sum of singular values of a matrix – encourage low-rank matrix


Theoretical Analysis

Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)

no more than n0 outliers per rowno more than m0 outliers per column

XL: rank is rincoherence: XL is “flat” – no component is large

Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):

Sparsity pattern supp(XS) is random: exact recovery under the followingconditions

m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:

m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)


References

Statistical Science Special Issue on Sparsity and Regularization, 2012http://www.imstat.org/sts/future_papers.html

Structured Sparsity: F. Bach, et alGeneral Theoretical Analysis: S. Negahban et al.Graphical Models: J. Lafferty et al.Nonconvex Methods: CH Zhang and T Zhang...


http://www.imstat.org/sts/future_papers.html

Download - Sparsity Models - Tsinghuabigeye.au.tsinghua.edu.cn › DragonStar2012 › docs › sparsity.pdf · Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models

Top Related