Sparsity Models
Tong Zhang
Rutgers University
T. Zhang (Rutgers) Sparsity Models 1 / 28
Topics
Standard sparse regression modelalgorithms: convex relaxation and greedy algorithmsparse recovery analysis: high level view
Some extensions (complex regularization)structured sparsitygraphical modelmatrix regularization
T. Zhang (Rutgers) Sparsity Models 2 / 28
Modern Sparsity Analysis: Motivation
Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality
Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications
Challenges:formulation, focusing on efficient computationmathematical analysis
T. Zhang (Rutgers) Sparsity Models 3 / 28
Modern Sparsity Analysis: Motivation
Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality
Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications
Challenges:formulation, focusing on efficient computationmathematical analysis
T. Zhang (Rutgers) Sparsity Models 3 / 28
Modern Sparsity Analysis: Motivation
Modern datasets are often high dimensionalstatistical estimation suffers from curse of dimensionality
Sparsity: popular assumption to address curse of dimensionalitymotivated from real applications
Challenges:formulation, focusing on efficient computationmathematical analysis
T. Zhang (Rutgers) Sparsity Models 3 / 28
Standard Sparse Regression
Model: Y = X β̄ + ε
Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2
High dimensional setting: n� pSparsity: β̄ has few nonzero components
supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n
T. Zhang (Rutgers) Sparsity Models 4 / 28
Standard Sparse Regression
Model: Y = X β̄ + ε
Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2
High dimensional setting: n� p
Sparsity: β̄ has few nonzero componentssupp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n
T. Zhang (Rutgers) Sparsity Models 4 / 28
Standard Sparse Regression
Model: Y = X β̄ + ε
Y ∈ Rn: observationX ∈ Rn×p: design matrixβ̄ ∈ Rp: parameter vector to be estimatedε ∈ Rn: zero mean stochastic noise with variance σ2
High dimensional setting: n� pSparsity: β̄ has few nonzero components
supp(β̄) = {j : β̄j 6= 0}.‖β̄‖0 = |supp(β̄)| is small: � n
T. Zhang (Rutgers) Sparsity Models 4 / 28
Algorithms for Standard Sparsity
L0 regularization: natural method (computationally inefficient)
β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k
L1 regularization (Lasso): convex relaxation (computationally efficient)
β̂L1 = arg minβ
[‖Y − Xβ‖22 + λ‖β‖1
]
Theoretical questions:how well can we estimate parameter β̄ (recovery performance)
T. Zhang (Rutgers) Sparsity Models 5 / 28
Algorithms for Standard Sparsity
L0 regularization: natural method (computationally inefficient)
β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k
L1 regularization (Lasso): convex relaxation (computationally efficient)
β̂L1 = arg minβ
[‖Y − Xβ‖22 + λ‖β‖1
]
Theoretical questions:how well can we estimate parameter β̄ (recovery performance)
T. Zhang (Rutgers) Sparsity Models 5 / 28
Algorithms for Standard Sparsity
L0 regularization: natural method (computationally inefficient)
β̂L0 = arg minβ‖Y − Xβ‖22, subject to ‖β‖0 ≤ k
L1 regularization (Lasso): convex relaxation (computationally efficient)
β̂L1 = arg minβ
[‖Y − Xβ‖22 + λ‖β‖1
]
Theoretical questions:how well can we estimate parameter β̄ (recovery performance)
T. Zhang (Rutgers) Sparsity Models 5 / 28
Greedy Algorithms for standard sparse regularization
Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize
minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k
Forward Greedy Algorithm (OMP): select variables one by one
Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p
find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}
terminate with some criterion;output β̂ using regression with selected variables F k
Theoretical question: recovery performance?
T. Zhang (Rutgers) Sparsity Models 6 / 28
Greedy Algorithms for standard sparse regularization
Reformulation: find variable set F ⊂ {1, . . . ,p} to minimize
minβ‖Xβ − Y‖22 supp(β) ⊂ F s.t. |F | ≤ k
Forward Greedy Algorithm (OMP): select variables one by one
Initialize variable set F k = ∅ at k = 0Iterate k = 1, . . . ,p
find best variable j to add to F k−1 (maximum reduction of squared error)F k = F k−1 ∪ {j}
terminate with some criterion;output β̂ using regression with selected variables F k
Theoretical question: recovery performance?
T. Zhang (Rutgers) Sparsity Models 6 / 28
Conditions and Results
Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?
supp(β̂) ≈ F̄?
Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?
‖β̂ − β̄‖22 ≤?
Are efficient algorithms (such as L1 or OMP) good enough?
Yes but require conditions:Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery
T. Zhang (Rutgers) Sparsity Models 7 / 28
Conditions and Results
Type of results (sparse recovery):Variable selection (can we find nonzero variables): can we recover thetrue support F̄?
supp(β̂) ≈ F̄?
Parameter estimation (how well we can estimate β̄): can we recoverthe parameters?
‖β̂ − β̄‖22 ≤?
Are efficient algorithms (such as L1 or OMP) good enough?Yes but require conditions:
Irrepresentable: for support recoveryRIP – Restricted Isometry Property: for parameter recovery
T. Zhang (Rutgers) Sparsity Models 7 / 28
KKT Condition for Lasso Solution
Lasso solution:
β̂L1 = arg minβ
[‖Y − Xβ‖22 + λ‖β‖1
]KKT condition: at β̂ = β̂L1:
Exists a sub-gradient being zero:for all j = 1, . . . ,p (Xj is the j-th column of X ):
2X>j (X β̂ − y) + λ∇|β̂j | = 0.
Subgradient of L1 norm: ∇|u| = sign(u) =
1 u > 0−1 u < 0∈ [−1,1] u = 0.
If we can find a β̂ that satisfies KKT condition, then it is Lasso solution.A slightly stronger condition implies uniqueness.
T. Zhang (Rutgers) Sparsity Models 8 / 28
Feature Selection Consistency of Lasso
Idea: construct a solution and check KKT condition.
Define β̂ such that β̂F̄ satisfies:
2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,
and set β̂F̄ c = 0.
Condition A:X>F̄ XF̄ is full rank
sign(β̂F̄ ) = sign(β̂F̄ )
2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .
Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)
T. Zhang (Rutgers) Sparsity Models 9 / 28
Feature Selection Consistency of Lasso
Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:
2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,
and set β̂F̄ c = 0.
Condition A:X>F̄ XF̄ is full rank
sign(β̂F̄ ) = sign(β̂F̄ )
2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .
Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)
T. Zhang (Rutgers) Sparsity Models 9 / 28
Feature Selection Consistency of Lasso
Idea: construct a solution and check KKT condition.Define β̂ such that β̂F̄ satisfies:
2X>F̄ (XF̄ β̂F̄ − y) + λsign(β̄)F̄ = 0,
and set β̂F̄ c = 0.
Condition A:X>F̄ XF̄ is full rank
sign(β̂F̄ ) = sign(β̂F̄ )
2|X>j (XF̄ β̂F̄ − y)| < λ for j /∈ F̄ .
Under Condition A:β̂ is the unique Lasso solution (satisfies KKT)
T. Zhang (Rutgers) Sparsity Models 9 / 28
Irrepresentable Condition
The condition
µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,
is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.
Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level
then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.
Condition similar to irrepresentable condition can be derived for OMP.
T. Zhang (Rutgers) Sparsity Models 10 / 28
Irrepresentable Condition
The condition
µ = supj 6∈F̄|X>j XF̄ (X>F̄ XF̄ )−1sign(β̄)F̄ | < 1,
is called irrepresentable condition.It implies condition A when y = X β̄ and λ is sufficiently small.
Under irrepresentable condition, ifnoise is sufficiently smallminj∈F̄ |β̄j | is larger than noise level
then there exists appropriate λ such that condition A holds. ThusLasso solution is unique and feature selection consistent.
Condition similar to irrepresentable condition can be derived for OMP.
T. Zhang (Rutgers) Sparsity Models 10 / 28
RIP Conditions
Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.
RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds
ρ+(ck̄)/ρ−(ck̄) <∞
ρ+(s) = sup{β>X>Xββ>β
: ‖β‖0 ≤ s}
ρ−(s) = inf{β>X>Xββ>β
: ‖β‖0 ≤ s}
with k̄ = |F̄ | = ‖β̄‖0.
T. Zhang (Rutgers) Sparsity Models 11 / 28
RIP Conditions
Feature selection consistency implies good parameter estimation.However irrepresentable condition is too strong.RIP (restricted isometry property): weaker condition which can beused to obtain parameter estimation result.Definition of RIP: for some c > 1, the following condition holds
ρ+(ck̄)/ρ−(ck̄) <∞
ρ+(s) = sup{β>X>Xββ>β
: ‖β‖0 ≤ s}
ρ−(s) = inf{β>X>Xββ>β
: ‖β‖0 ≤ s}
with k̄ = |F̄ | = ‖β̄‖0.
T. Zhang (Rutgers) Sparsity Models 11 / 28
Results under Restricted Isometry Property
Parameter estimation under RIP:
‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),
where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible
Feature selection under RIP:neither procedure achieves feature selection consistency.
Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently
T. Zhang (Rutgers) Sparsity Models 12 / 28
Results under Restricted Isometry Property
Parameter estimation under RIP:
‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),
where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible
Feature selection under RIP:neither procedure achieves feature selection consistency.
Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently
T. Zhang (Rutgers) Sparsity Models 12 / 28
Results under Restricted Isometry Property
Parameter estimation under RIP:
‖β̄ − β̂‖2 = O(σ2‖β̄‖0ln p/n),
where σ2 is noise variance.this result can be obtained both for Lasso and for OMP.it is best possible
Feature selection under RIP:neither procedure achieves feature selection consistency.
Improvement:non-convex formulation is needed for optimal feature selectionunder RIPtrickier to analyze: general theory only appeared very recently
T. Zhang (Rutgers) Sparsity Models 12 / 28
Complex Regularization: structured sparsity
Wavelet domain: sparsity pattern not random (structured)
Image domain Wavelet domain
can we take advantage of structure?
T. Zhang (Rutgers) Sparsity Models 13 / 28
Complex Regularization: structured sparsity
Wavelet domain: sparsity pattern not random (structured)
Image domain Wavelet domain
can we take advantage of structure?
T. Zhang (Rutgers) Sparsity Models 13 / 28
Structured Sparsity Characterization
Observation:sparsity pattern is the set of nonzero coefficientsnot all sparse patterns are equally likely
Our proposal: information theoretical characterization of “structure”:
a sparsity pattern F is associated with cost c(F )c(F ) is negative log-likelihood of F (or its multiple).
Optimization problem:
minβ‖Xβ − Y‖22 subject to ‖β‖0 + c(supp(β)) ≤ s.
c(supp(β)): cost for selecting support supp(β)‖β‖0: cost for estimation after feature selection
T. Zhang (Rutgers) Sparsity Models 14 / 28
Example: Group Structure
Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per groupExample (m = 4)
G1 G2 · · · G4 · · · Gp/m
nodes: variablesgray nodes: selected variables (groups 1,2,4)
Assumption:coefficients are not completely randomcoefficients in each group are simultaneously (or nearlysimultaneously) zeros or nonzeros
How to take advantage of group structure?
T. Zhang (Rutgers) Sparsity Models 15 / 28
Example: Group Structure
Variables are divided into pre-defined groups G1, . . . ,Gp/mm variables per group
Assumption:coefficients in each group are simultaneously zeros or nonzeros
Group sparsity pattern cost: ‖β‖0 + m−1‖β‖0 ln p.Standard sparsity pattern cost (for Lasso): ‖β‖0 ln pTheoretical question:
can we take advantage of group sparsity structure to improve Lasso?
T. Zhang (Rutgers) Sparsity Models 16 / 28
Convex Relaxation for group sparsity
L1 − L2 convex relaxation (group Lasso)
β̂ = arg minβ‖Xβ − Y‖22 + λ
∑j
‖βGj‖2.
This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)
Question: what is the benefit of group Lasso formulation?
T. Zhang (Rutgers) Sparsity Models 17 / 28
Convex Relaxation for group sparsity
L1 − L2 convex relaxation (group Lasso)
β̂ = arg minβ‖Xβ − Y‖22 + λ
∑j
‖βGj‖2.
This is supposed to take advantage of group sparsity structurewithin group: uses L2 regularization (doesn’t encourage sparsity)across group: uses L1 regularization (encourage sparsity)
Question: what is the benefit of group Lasso formulation?
T. Zhang (Rutgers) Sparsity Models 17 / 28
Recovery Analysis for Lasso and Group Lasso
Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:
‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)
Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))
Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)
‖β̂ − β̄‖22 = O
σ2
n( g ln(p/m)︸ ︷︷ ︸group selection
+ mg︸︷︷︸estimation after group selection
)
T. Zhang (Rutgers) Sparsity Models 18 / 28
Recovery Analysis for Lasso and Group Lasso
Simple sparsity: s = ‖β̄‖0 variables out of p variablesinformation theoretical complexity (log of mumbler of choices): O(s ln p)statistical recovery performance:
‖β̂ − β̄‖22 = O(σ2‖β̄‖0 ln p/n)
Group sparsity: g groups out of p/m groups (ideally g = ‖β̄‖0/m)information theoretical complexity (log of number of choices):O(g ln(p/m))
Statistical recovery performance for group Lasso: if supp(β̄) iscovered in g groups, under group RIP (weaker than RIP)
‖β̂ − β̄‖22 = O
σ2
n( g ln(p/m)︸ ︷︷ ︸group selection
+ mg︸︷︷︸estimation after group selection
)
T. Zhang (Rutgers) Sparsity Models 18 / 28
Group sparsity: correct group structure
50 100 150 200 250 300 350 400 450 500−2
0
2(a) Original
50 100 150 200 250 300 350 400 450 500−2
0
2(b) Lasso
50 100 150 200 250 300 350 400 450 500−2
0
2(b) Group Lasso
T. Zhang (Rutgers) Sparsity Models 19 / 28
Group sparsity: incorrect group structure
0 100 200 300 400 500−2
0
2(a) Original
0 100 200 300 400 500−2
0
2(b) Lasso
0 100 200 300 400 500−2
0
2(c) Group Lasso
T. Zhang (Rutgers) Sparsity Models 20 / 28
Matrix Formulation: Graphical Model Example
Learning gene interaction network structure
T. Zhang (Rutgers) Sparsity Models 21 / 28
Formulation: Gaussian Graphical Model
Multi-dimensional Gaussian vectors: X1, . . .Xn ∼ N(µ,Σ).Precision matrix Θ = Σ−1
Non-zeros of precision matrix gives graphical model structure:
P(Xi) ∝ |Θ|exp[−1
2(Xi − µ)T Θ(Xi − µ)
].
where | · | is determinant.Estimation: L1 regularized maximum likelihood estimator
Θ̂ = arg minΘ
[− ln |Θ|+ tr(Σ̂Θ) + λ‖Θ‖1
],
‖ · ‖1: element L1 regularization to encourage sparsityΣ̂: empirical covariance matrix.
Analysis exists (feature selection and parameter estimation):techniques similar to L1 analysis but not satisfactory
T. Zhang (Rutgers) Sparsity Models 22 / 28
Matrix Completion
user
movieM1 M2 M3 M4 M5 M6 M7 M8
U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2
m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?
require assumptions:intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT
i vj . Let X the true rating matrix
X ≈ UV T (U : m × r V : n × r)
T. Zhang (Rutgers) Sparsity Models 23 / 28
Matrix Completion
user
movieM1 M2 M3 M4 M5 M6 M7 M8
U1 1 ? 0 4 2 1 ? ?U2 2 4 ? ? 4 ? 1 2U3 2 3 2 ? 4 ? ? 2U4 ? 1 0 3 2 ? 1 1U5 1 ? ? 1 ? ? 3 2
m × n matrix: m users and n movies with incomplete ratingscan we fill-in the missing values?require assumptions:
intuition: U2 and U3 has similar ratings on observed ratings — assumethey have similar preferencelow-rank (rank-r ) structure: user i to ui ∈ Rr and movie j to vj ∈ Rr ,with rating ≈ uT
i vj . Let X the true rating matrix
X ≈ UV T (U : m × r V : n × r)
T. Zhang (Rutgers) Sparsity Models 23 / 28
Formulation
Let S = {observed (i , j) entries}Let yij be the observed values for (i , j) ∈ SLet X be the true underlying rating matrix
We want to find X to fit observed yij , assuming X is low-rank:
minX∈Rm×n
∑(i,j)∈S
(Xij − yij)2 + λ · rank(X )
.rank(X ): nonconvex function of Xconvex relaxation: trace-norm ‖X‖∗, defined as the sum of singularvalues of X .
The convex reformulation is
minX∈Rm×n
∑(i,j)∈S
(Xij − yij)2 + λ · ‖X‖∗
.Solution of trace norm regularization is low-rank.
T. Zhang (Rutgers) Sparsity Models 24 / 28
Sparsity versus Low-rank
A vector β ∈ Rp: p parametersreduce dimension — sparsity ‖β‖0 is smallconstraint ‖β‖0 ≤ s is nonconvexconvex relaxation: convex hull of unit 1-sparse vectors, which gives L1regularization ‖β‖1 ≤ 1vector solution with L1 regularization is sparse
A matrix X ∈ Rm×n: m × n parametersreduce dimension — lowrank X =
∑rj=1 ujvT
j where uj ∈ Rm andvj ∈ Rn are vectorsnumber of parameters — no more than rm + rn.rank constraint is nonconvexconvex relaxation: convex hull of unit rank-one matrices, which givestrace-norm regularization ‖X‖∗ ≤ 1matrix solution with trace-norm regularization is low-rank
T. Zhang (Rutgers) Sparsity Models 25 / 28
Matrix Regularization Example: mixed sparsity and low rank
Y (observed) = XL(low-rank) + XS(sparse)
[X̂S, X̂L] = arg min
12µ‖(XS + XL)− Y‖22 + λ‖XS‖1 + ‖XL‖∗︸ ︷︷ ︸
trace norm
.trace norm: sum of singular values of a matrix – encourage low-rank matrix
T. Zhang (Rutgers) Sparsity Models 26 / 28
Theoretical Analysis
Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)
no more than n0 outliers per rowno more than m0 outliers per column
XL: rank is rincoherence: XL is “flat” – no component is large
Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):
Sparsity pattern supp(XS) is random: exact recovery under the followingconditions
m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:
m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)
T. Zhang (Rutgers) Sparsity Models 27 / 28
Theoretical Analysis
Want to know under what conditions, we can recovery XS and XLMatrix is m × n.XS: sparse (spike like outliers)
no more than n0 outliers per rowno more than m0 outliers per column
XL: rank is rincoherence: XL is “flat” – no component is large
Question: how many outliers per row n0 and per column m0 areallowed to recover XS and XL?Partial answer (not completely satisfactory):
Sparsity pattern supp(XS) is random: exact recovery under the followingconditions
m0 = O(m) and n0 = O(n).Sparsity pattern supp(XS) does not have to be random:
m0 ≤ c(m/r) and n0 ≤ c(n/r) for some constant c (r is the rank of XL)
T. Zhang (Rutgers) Sparsity Models 27 / 28
References
Statistical Science Special Issue on Sparsity and Regularization, 2012http://www.imstat.org/sts/future_papers.html
Structured Sparsity: F. Bach, et alGeneral Theoretical Analysis: S. Negahban et al.Graphical Models: J. Lafferty et al.Nonconvex Methods: CH Zhang and T Zhang...
T. Zhang (Rutgers) Sparsity Models 28 / 28