theory and algorithms sparse methods for machine learningfbach/nips2009tutorial/nips... · sparse...
TRANSCRIPT
Sparse
meth
ods
for
mach
ine
learn
ing
Theory
and
alg
orith
ms
Fra
ncis
Bach
Willow
project,
INRIA
-Eco
leN
ormale
Superieu
re
NIP
STutorial
-D
ecember
2009
Special
than
ksto
R.Jen
atton,J.
Mairal,
G.O
bozin
ski
Superv
ised
learn
ing
and
regulariza
tion
•D
ata:x
i ∈X
,y
i ∈Y
,i=
1,...,n
•M
inim
izew
ithresp
ectto
function
f:X
→Y
:
n∑i=
1
ℓ(yi ,f
(xi ))
+λ2 ‖f‖
2
Error
ondata
+Regu
larization
Loss
&fu
nction
space
?N
orm?
•T
wo
theoretical/algorith
mic
issues:
1.Loss
2.Functio
nsp
ace
/norm
Regulariza
tions
•M
ain
goal:
avo
idove
rfittin
g
•Two
main
lines
ofwork
:
1.Euclid
eanan
dH
ilbertian
norm
s(i.e.,
ℓ2 -n
orms)
–Possib
ilityof
non
linear
predictors
–N
onparam
etricsu
pervised
learnin
gan
dkern
elm
ethods
–W
elldevelop
ped
theory
and
algorithm
s(see,
e.g.,W
ahba,
1990;
Sch
olkopfan
dSm
ola,2001;
Shaw
e-Taylor
and
Cristian
ini,
2004)
Regulariza
tions
•M
ain
goal:
avo
idove
rfittin
g
•Two
main
lines
ofwork
:
1.Euclid
eanan
dH
ilbertian
norm
s(i.e.,
ℓ2 -n
orms)
–Possib
ilityof
non
linear
predictors
–N
onparam
etricsu
pervised
learnin
gan
dkern
elm
ethods
–W
elldevelop
ped
theory
and
algorithm
s(see,
e.g.,W
ahba,
1990;
Sch
olkopfan
dSm
ola,2001;
Shaw
e-Taylor
and
Cristian
ini,
2004)
2.Sparsity-in
ducin
gnorm
s
–U
sually
restrictedto
linear
predictors
onvectors
f(x
)=
w⊤x
–M
ainexam
ple:
ℓ1 -n
orm‖w‖
1=
∑pi=
1 |wi |
–Perform
model
selectionas
well
asregu
larization
–Theory
and
alg
orith
ms
“in
the
makin
g”
ℓ2
vs.
ℓ1-
Gaussia
nhare
vs.
Lapla
cian
torto
ise
•First-ord
erm
ethods
(Fu,1998;
Wu
and
Lan
ge,2008)
•H
omotopy
meth
ods
(Markow
itz,1956;
Efron
etal.,
2004)
Lasso
-T
wo
main
rece
nt
theore
ticalre
sults
1.Support
reco
very
conditio
n(Z
hao
and
Yu,2006;
Wain
wrigh
t,2009;
Zou
,2006;
Yuan
and
Lin
,2007):
the
Lasso
issign
-consisten
tif
and
only
if
‖Q
JcJ Q
−1
JJsign
(wJ )‖∞
61,
where
Q=
limn→
+∞
1n
∑ni=
1x
i x⊤i∈
Rp×
p
Lasso
-T
wo
main
rece
nt
theore
ticalre
sults
1.Support
reco
very
conditio
n(Z
hao
and
Yu,2006;
Wain
wrigh
t,2009;
Zou
,2006;
Yuan
and
Lin
,2007):
the
Lasso
issign
-consisten
tif
and
only
if
‖Q
JcJ Q
−1
JJsign
(wJ )‖∞
61,
where
Q=
limn→
+∞
1n
∑ni=
1x
i x⊤i∈
Rp×
p
2.Exp
onentia
llym
any
irrele
vant
variable
s(Z
hao
and
Yu,
2006;
Wain
wrigh
t,2009;
Bickel
etal.,
2009;Lou
nici,
2008;M
einsh
ausen
and
Yu,2008):
under
appropriate
assum
ption
s,con
sistency
ispossib
le
aslon
gas
logp
=O
(n)
Goin
gbeyo
nd
the
Lasso
•ℓ1 -n
ormfor
linear
feature
selectionin
hig
hdim
ensio
ns
–Lasso
usu
allynot
applicab
ledirectly
•N
on-lin
earitie
s
•D
ealin
gw
ithexp
onentia
llym
any
featu
res
•Sparse
learn
ing
on
matrice
s
Goin
gbeyo
nd
the
Lasso
Non-lin
earity
-M
ultip
lekern
elle
arnin
g
•M
ultip
lekern
elle
arnin
g
–Learn
sparse
combin
ationof
matrices
k(x
,x′)
=∑
pj=
1η
j kj (x
,x′)
–M
ixing
positive
aspects
ofℓ1 -n
orms
and
ℓ2 -n
orms
•Equiva
lent
togro
up
Lasso
–p
multi-d
imen
sional
features
Φj (x
),w
here
kj (x
,x′)
=Φ
j (x) ⊤
Φj (x
′)
–learn
predictor
∑pj=
1w
⊤jΦ
j (x)
–Pen
alizationby
∑pj=
1 ‖wj ‖
2
Goin
gbeyo
nd
the
Lasso
Stru
cture
dse
toffe
atu
res
•D
ealin
gw
ithexp
onentia
llym
any
featu
res
–Can
we
design
efficien
talgorith
ms
forth
ecase
logp≈
n?
–U
sestru
cture
tored
uce
the
num
ber
ofallow
edpattern
sof
zeros
–Recu
rsivity,hie
rarchie
san
dfactorization
•Prio
rin
form
atio
non
sparsity
patte
rns
–G
rouped
variables
with
overlappin
ggrou
ps
Goin
gbeyo
nd
the
Lasso
Sparse
meth
ods
on
matrice
s
•Learn
ing
pro
ble
ms
on
matrice
s
–M
ulti-task
learnin
g
–M
ulti-category
classification
–M
atrixcom
pletion
–Im
ageden
oising
–N
MF,top
icm
odels,
etc.
•M
atrix
facto
rizatio
n
–T
wo
types
ofsp
arsity(low
-rank
ordiction
arylearn
ing)
Sparse
meth
ods
for
mach
ine
learn
ing
Outlin
e
•In
troductio
n-O
vervie
w
•Sparse
linear
estim
atio
nw
ithth
eℓ1 -n
orm
–Con
vexop
timization
and
algorithm
s
–T
heoretical
results
•Stru
cture
dsp
arsem
eth
ods
on
vecto
rs
–G
roups
offeatu
res/
Multip
lekern
ellearn
ing
–Exten
sions
(hierarch
icalor
overlappin
ggrou
ps)
•Sparse
meth
ods
on
matrice
s
–M
ulti-task
learnin
g
–M
atrixfactorization
(low-ran
k,sp
arsePCA
,diction
arylearn
ing)
Why
ℓ1 -n
orm
constra
ints
leads
tosp
arsity?
•Exam
ple:
min
imize
quad
raticfu
nction
Q(w
)su
bject
to‖w‖
16
T.
–cou
pled
softth
reshold
ing
•G
eometric
interpretation
–N
B:pen
alizing
is“eq
uivalen
t”to
constrain
ing
1
2w
w1
2w
w
ℓ1 -n
orm
regulariza
tion
(linear
settin
g)
•D
ata:covariates
xi ∈
Rp,
respon
sesy
i ∈Y
,i=
1,...,n
•M
inim
izew
ithresp
ectto
loadin
gs/weigh
tsw
∈R
p:
J(w
)=
n∑i=
1
ℓ(yi ,w
⊤x
i )+
λ‖w‖1
Error
ondata
+Regu
larization
•In
cludin
ga
constan
tterm
b?Pen
alizing
orcon
strainin
g?
•sq
uare
loss⇒basis
pursu
itin
signal
processin
g(C
hen
etal.,
2001),
Lasso
instatistics/m
achin
elearn
ing
(Tib
shiran
i,1996)
Are
vie
wofnonsm
ooth
convex
analy
sisand
optim
izatio
n
•A
nalysis:
optim
alitycon
dition
s
•O
ptim
ization:
algorithm
s
–First-ord
erm
ethods
•Books:
Boyd
and
Van
den
bergh
e(2004),
Bon
nan
set
al.(2003),
Bertsekas
(1995),B
orwein
and
Lew
is(2000)
Optim
ality
conditio
ns
for
smooth
optim
izatio
n
Zero
gra
die
nt
•Exam
ple:
ℓ2 -regu
larization:
min
w∈
Rp
n∑i=
1
ℓ(yi ,w
⊤x
i )+
λ2 ‖w‖22
–G
radien
t∇
J(w
)=
∑ni=
1ℓ ′(y
i ,w⊤x
i )xi+
λw
where
ℓ ′(yi ,w
⊤x
i )
isth
epartial
derivative
ofth
eloss
w.r.t
the
second
variable
–If
square
loss,∑
ni=1ℓ(y
i ,w⊤x
i )=
12 ‖y−
Xw‖
22
∗grad
ient
=−
X⊤(y−
Xw
)+
λw
∗norm
aleq
uation
s⇒w
=(X
⊤X
+λI) −
1X⊤y
Optim
ality
conditio
ns
for
smooth
optim
izatio
n
Zero
gra
die
nt
•Exam
ple:
ℓ2 -regu
larization:
min
w∈
Rp
n∑i=
1
ℓ(yi ,w
⊤x
i )+
λ2 ‖w‖22
–G
radien
t∇
J(w
)=
∑ni=
1ℓ ′(y
i ,w⊤x
i )xi+
λw
where
ℓ ′(yi ,w
⊤x
i )
isth
epartial
derivative
ofth
eloss
w.r.t
the
second
variable
–If
square
loss,∑
ni=1ℓ(y
i ,w⊤x
i )=
12 ‖y−
Xw‖
22
∗grad
ient
=−
X⊤(y−
Xw
)+
λw
∗norm
aleq
uation
s⇒w
=(X
⊤X
+λI) −
1X⊤y
•ℓ1 -n
orm
isnon
diff
ere
ntia
ble
!
–can
not
compute
the
gradien
tof
the
absolu
tevalu
e
⇒D
irectio
nalderiva
tives
(orsu
bgrad
ient)
Dire
ctionalderiv
ativ
es
-co
nvex
functio
ns
on
Rp
•D
irectional
derivative
inth
edirection
∆at
w:
∇J(w
,∆)
=lim
ε→0+
J(w
+ε∆
)−J(w
)
ε
•A
lways
existw
hen
Jis
convex
and
contin
uou
s
•M
ainid
ea:in
non
smooth
situation
s,m
ayneed
tolo
okat
all
direction
s∆
and
not
simply
pin
dep
enden
ton
es
•Pro
positio
n:
Jis
diff
erentiab
leat
w,if
and
only
if∆
7→∇
J(w
,∆)
islin
ear.T
hen
,∇J(w
,∆)
=∇
J(w
) ⊤∆
Optim
ality
conditio
ns
for
convex
functio
ns
•U
ncon
strained
min
imization
(function
defi
ned
onR
p):
–Pro
positio
n:
wis
optim
alif
and
only
if∀∆
∈R
p,∇J(w
,∆)
>0
–G
oup
locally
inall
direction
s
•Red
uces
tozero-grad
ient
forsm
ooth
problem
s
•Con
strained
min
imization
(function
defi
ned
ona
convex
setK
)
–restrict
∆to
direction
sso
that
w+
ε∆∈
Kfor
small
ε
Dire
ctionalderiv
ativ
es
for
ℓ1 -n
orm
regulariza
tion
•Function
J(w
)=
n∑i=
1
ℓ(yi ,w
⊤x
i )+
λ‖w‖1
=L
(w)+
λ‖w‖1
•ℓ1 -n
orm:‖w
+ε∆
‖1 −
‖w‖1=
∑
j,w
j 6=0 {|w
j+
ε∆j |−
|wj |}
+∑
j,w
j =0 |ε∆
j |
•T
hus,
∇J(w
,∆)
=∇
L(w
) ⊤∆
+λ
∑
j,w
j 6=0
sign(w
j )∆j+
λ∑
j,w
j =0 |∆
j |
=∑
j,w
j 6=0 [∇
L(w
)j+
λsign
(wj )]∆
j+
∑
j,w
j =0 [∇
L(w
)j ∆
j+
λ|∆j |]
•Sep
arability
ofop
timality
condition
s
Optim
ality
conditio
ns
for
ℓ1 -n
orm
regulariza
tion
•G
eneral
loss:w
optim
alif
and
only
iffor
allj∈
{1,...,p},
sign(w
j )6=0
⇒∇
L(w
)j+
λsign
(wj )
=0
sign(w
j )=
0⇒
|∇L
(w)j |
6λ
•Square
loss:w
optim
alif
and
only
iffor
allj∈{1,...,p},
sign(w
j )6=0
⇒−
X⊤j(y
−X
w)+
λsign
(wj )
=0
sign(w
j )=
0⇒
|X⊤j(y
−X
w)|
6λ
–For
J⊂
{1,...,p},X
J∈
Rn×
|J|=
X(:,J
)den
otesth
ecolu
mns
ofX
indexed
byJ,i.e.,
variables
indexed
byJ
First
ord
er
meth
ods
for
convex
optim
izatio
non
Rp
Sm
ooth
optim
izatio
n
•Gra
die
nt
desce
nt:
wt+
1=
wt −
αt ∇
J(w
t )
–w
ithlin
esearch
:search
fora
decen
t(n
otnecessarily
best)
αt
–fixed
dim
inish
ing
stepsize,
e.g.,α
t=
a(t
+b) −
1
•Con
vergence
off(w
t )to
f∗
=m
inw∈
Rpf(w
)(N
esterov,2003)
–f
convex
and
M-L
ipsch
itz:f(w
t )−f∗
=O
(M/ √
t)
–an
d,diff
erentiab
lew
ithL
-Lip
schitz
gradien
t:f(w
t )−f∗
=O
(L/t
)
–an
d,f
µ-stron
glycon
vex:f(w
t )−f∗
=O
(Lex
p(−
4tµL))
•µL
=con
dition
num
ber
ofth
eop
timization
problem
•Coord
inate
descen
t:sim
ilarprop
erties
•N
B:“op
timalsch
eme”
f(w
t )−f∗
=O
(Lm
in{exp(−
4t√
µ/L
),t −2}
)
First-o
rder
meth
ods
for
convex
optim
izatio
non
Rp
Non
smooth
optim
izatio
n
•First-ord
erm
ethods
fornon
diff
erentiab
leob
jective
–Subgrad
ient
descen
t:w
t+1
=w
t −α
t gt ,
with
gt ∈
∂J(w
t ),i.e.,
such
that∀
∆,g ⊤t
∆6
∇J(w
t ,∆)
∗w
ithexact
line
search:
not
always
convergen
t(see
counter-
example)
∗dim
inish
ing
stepsize,
e.g.,α
t=
a(t
+b) −
1:con
vergent
–Coord
inate
descen
t:not
always
convergen
t(sh
owcou
nter-exam
ple)
•Con
vergence
rates(f
convex
and
M-L
ipsch
itz):f(w
t )−f∗
=O
(M√
t
)
Counte
r-exam
ple
Coord
inate
desce
nt
for
nonsm
ooth
obje
ctives
5
4
32
1
w
w2
1
Counte
r-exam
ple
(Bertse
kas,
1995)
Ste
epest
desce
nt
for
nonsm
ooth
obje
ctives
•q(x
1 ,x2 )
=
{−
5(9x21+
16x22 )
1/2
ifx
1>
|x2 |
−(9x
1+
16|x2 |)
1/2
ifx
16
|x2 |
•Steep
estdescen
tstartin
gfrom
any
xsu
chth
atx
1>
|x2 |
>
(9/16)2|x
1 |
−5
05
−5 0 5
Sparsity
-inducin
gnorm
s
Usin
gth
estru
cture
ofth
epro
ble
m
•Prob
lems
ofth
eform
min
w∈
RpL
(w)+
λ‖w‖or
min
‖w‖6
µL
(w)
–L
smooth
–O
rthogon
alprojectionson
the
ballor
the
dualb
allcanbe
perform
ed
insem
i-closedform
,e.g.,
ℓ1 -n
orm(M
aculan
and
GA
LD
INO
DE
PAU
LA
,1989)
orm
ixedℓ1 -ℓ
2(see,
e.g.,van
den
Berg
etal.,
2009)
•M
ayuse
similar
techniq
ues
than
smooth
optim
ization
–Projected
gradien
tdescen
t
–Proxim
alm
ethods
(Beck
and
Teb
oulle,
2009)
–D
ual
ascent
meth
ods
•Sim
ilarcon
vergence
rates
–depends
on
the
conditio
nnum
ber
ofth
elo
ss
Cheap
(and
not
dirty
)alg
orith
ms
for
all
losse
s
•Coord
inate
desce
nt
(Fu,
1998;W
uan
dLan
ge,2008;
Fried
man
etal.,
2007)
–con
vergenthere
under
reasonab
leassu
mption
s!(B
ertsekas,1995)
–sep
arability
ofop
timality
condition
s
–eq
uivalen
tto
iterativeth
reshold
ing
Cheap
(and
not
dirty
)alg
orith
ms
for
all
losse
s
•Coord
inate
desce
nt
(Fu,
1998;W
uan
dLan
ge,2008;
Fried
man
etal.,
2007)
–con
vergenthere
under
reasonab
leassu
mption
s!(B
ertsekas,1995)
–sep
arability
ofop
timality
condition
s
–eq
uivalen
tto
iterativeth
reshold
ing
•“η-trick
”(M
icchelli
and
Pon
til,2006;
Rakotom
amon
jyet
al.,2008;
Jenatton
etal.,
2009b)
–N
oticeth
at∑
pj=
1 |wj |
=m
inη>
012
∑pj=
1
{w
2j
ηj
+η
j
}
–A
lternatin
gm
inim
izationw
ithresp
ectto
η(closed
-form)
and
w
(weigh
tedsq
uared
ℓ2 -n
ormregu
larizedprob
lem)
Cheap
(and
not
dirty
)alg
orith
ms
for
all
losse
s
•Coord
inate
desce
nt
(Fu,
1998;W
uan
dLan
ge,2008;
Fried
man
etal.,
2007)
–con
vergenthere
under
reasonab
leassu
mption
s!(B
ertsekas,1995)
–sep
arability
ofop
timality
condition
s
–eq
uivalen
tto
iterativeth
reshold
ing
•“η-trick
”(M
icchelli
and
Pon
til,2006;
Rakotom
amon
jyet
al.,2008;
Jenatton
etal.,
2009b)
–N
oticeth
at∑
pj=
1 |wj |
=m
inη>
012
∑pj=
1
{w
2j
ηj
+η
j
}
–A
lternatin
gm
inim
izationw
ithresp
ectto
η(closed
-form)
and
w
(weigh
tedsq
uared
ℓ2 -n
ormregu
larizedprob
lem)
•D
edica
ted
alg
orith
ms
thatuse
sparsity
(activesets
and
hom
otopy
meth
ods)
Specia
lca
seofsq
uare
loss
•Q
uadra
ticpro
gra
mm
ing
form
ula
tion:
min
imize
12 ‖y−X
w‖2+
λ
p∑j=
1 (w+j+
w−j)
such
that
w=
w+−
w−,
w+
>0,
w−
>0
Specia
lca
seofsq
uare
loss
•Q
uadra
ticpro
gra
mm
ing
form
ula
tion:
min
imize
12 ‖y−X
w‖2+
λ
p∑j=
1 (w+j+
w−j)
such
that
w=
w+−
w−,
w+
>0,
w−
>0
–generic
toolb
oxe
s⇒
very
slow
•M
ain
pro
perty
:if
the
signpattern
s∈{−
1,0,1}p
ofth
esolu
tionis
know
n,th
esolu
tioncan
be
obtain
edin
closedform
–Lasso
equivalen
tto
min
imizin
g12 ‖y−
XJw
J ‖2+
λs ⊤J
wJ
w.r.t.
wJ
where
J=
{j,sj 6=
0}.–
Closed
formsolu
tionw
J=
(X⊤JX
J) −
1(X⊤Jy−
λs
J)
•Alg
orith
m:
“Guess”
sand
check
optim
ality
conditio
ns
Optim
ality
conditio
ns
for
the
sign
vecto
rs
(Lasso
)
•For
s∈{−
1,0,1}p
signvector,
J=
{j,sj 6=
0}th
enon
zeropattern
•poten
tialclosed
formsolu
tion:
wJ
=(X
⊤JX
J) −
1(X⊤Jy−
λs
J)
and
wJ
c=
0
•s
isop
timal
ifan
don
lyif
–active
variables:
sign(w
J)
=s
J
–in
activevariab
les:‖X⊤J
c (y−
XJw
J)‖∞
6λ
•Active
set
alg
orith
ms
(Lee
etal.,
2007;Roth
and
Fisch
er,2008)
–Con
struct
Jiteratively
byad
din
gvariab
lesto
the
activeset
–O
nly
requires
toin
vertsm
alllin
earsystem
s
Hom
oto
py
meth
ods
for
the
square
loss
(Mark
ow
itz,
1956;O
sborn
eet
al.,
2000;Efro
net
al.,
2004)
•Goal:
Get
allsolu
tions
forall
possib
levalu
esof
the
regularization
param
eterλ
•Sam
eid
eaas
before:
ifth
esign
vectoris
know
n,
w∗J(λ
)=
(X⊤JX
J) −
1(X⊤Jy−
λs
J)
valid,as
long
as,
–sign
condition
:sign
(w∗J(λ
))=
sJ
–su
bgrad
ient
condition
:‖X⊤J
c (XJw
∗J(λ
)−y)‖∞
6λ
–th
isdefi
nes
anin
tervalon
λ:
the
path
isth
uspie
cew
iseaffi
ne
•Sim
ply
need
tofind
breakpoin
tsan
ddirection
s
Pie
cew
iselin
ear
path
s
00
.10
.20
.30
.40
.50
.6
−0
.6
−0
.4
−0
.2 0
0.2
0.4
0.6
reg
ula
riza
tion
pa
ram
ete
r
weights
Alg
orith
ms
for
ℓ1 -n
orm
s(sq
uare
loss):
Gaussia
nhare
vs.
Lapla
cian
torto
ise
•Coord
inate
descen
t:O
(pn)
per
iterations
forℓ1
and
ℓ2
•“E
xact”algorith
ms:
O(k
pn)
forℓ1
vs.O
(p2n
)for
ℓ2
Additio
nalm
eth
ods
-Softw
ares
•M
any
contrib
ution
sin
signal
processin
g,op
timization
,m
achin
e
learnin
g
–Proxim
alm
ethods
(Nesterov,
2007;B
eckan
dTeb
oulle,
2009)
–Exten
sions
tosto
chastic
setting
(Bottou
and
Bou
squet,
2008)
•Exten
sions
tooth
ersp
arsity-inducin
gnorm
s
•Softw
ares
–M
any
available
codes
–SPA
MS
(SPA
rseM
odelin
gSoftw
are)-
note
diff
erence
with
SpA
M(R
avikum
aret
al.,2008)
http://www.di.ens.fr/willow/SPAMS/
Sparse
meth
ods
for
mach
ine
learn
ing
Outlin
e
•In
troductio
n-O
vervie
w
•Sparse
linear
estim
atio
nw
ithth
eℓ1 -n
orm
–Con
vexop
timization
and
algorithm
s
–T
heoretical
results
•Stru
cture
dsp
arsem
eth
ods
on
vecto
rs
–G
roups
offeatu
res/
Multip
lekern
ellearn
ing
–Exten
sions
(hierarch
icalor
overlappin
ggrou
ps)
•Sparse
meth
ods
on
matrice
s
–M
ulti-task
learnin
g
–M
atrixfactorization
(low-ran
k,sp
arsePCA
,diction
arylearn
ing)
Theore
ticalre
sults
-Square
loss
•M
ainassu
mption
:data
generated
froma
certainsp
arsew
•T
hree
main
problem
s:
1.Regular
consiste
ncy
:con
vergence
ofestim
atorw
tow
,i.e.,
‖w−
w‖
tends
tozero
when
nten
ds
to∞
2.M
odelse
lectio
nco
nsiste
ncy
:con
vergence
ofth
esp
arsitypattern
ofw
toth
epattern
w
3.Effi
ciency
:con
vergence
ofpred
ictions
with
wto
the
prediction
s
with
w,i.e.,
1n ‖Xw−
Xw‖22
tends
tozero
•M
ainresu
lts:
–Conditio
nfo
rm
odelco
nsiste
ncy
(support
reco
very)
–H
igh-d
imensio
nalin
fere
nce
Modelse
lectio
nco
nsiste
ncy
(Lasso
)
•A
ssum
ew
sparse
and
den
oteJ
={j,w
j 6=0}
the
non
zeropattern
•Support
reco
very
conditio
n(Z
hao
and
Yu,2006;
Wain
wrigh
t,2009;
Zou
,2006;
Yuan
and
Lin
,2007):
the
Lasso
issign
-consisten
tif
and
only
if‖Q
JcJ Q
−1
JJsign
(wJ )‖∞
61
where
Q=
limn→
+∞
1n
∑ni=
1x
i x⊤i∈
Rp×
p(covarian
cem
atrix)
Modelse
lectio
nco
nsiste
ncy
(Lasso
)
•A
ssum
ew
sparse
and
den
oteJ
={j,w
j 6=0}
the
non
zeropattern
•Support
reco
very
conditio
n(Z
hao
and
Yu,2006;
Wain
wrigh
t,2009;
Zou
,2006;
Yuan
and
Lin
,2007):
the
Lasso
issign
-consisten
tif
and
only
if‖Q
JcJ Q
−1
JJsign
(wJ )‖∞
61
where
Q=
limn→
+∞
1n
∑ni=
1x
i x⊤i∈
Rp×
p(covarian
cem
atrix)
•Con
dition
dep
ends
onw
and
J(m
aybe
relaxed)
–m
aybe
relaxedby
maxim
izing
out
sign(w
)or
J
•Valid
inlow
and
high
-dim
ension
alsettin
gs
•Req
uires
lower-b
ound
onm
agnitu
de
ofnon
zerow
j
Modelse
lectio
nco
nsiste
ncy
(Lasso
)
•A
ssum
ew
sparse
and
den
oteJ
={j,w
j 6=0}
the
non
zeropattern
•Support
reco
very
conditio
n(Z
hao
and
Yu,2006;
Wain
wrigh
t,2009;
Zou
,2006;
Yuan
and
Lin
,2007):
the
Lasso
issign
-consisten
tif
and
only
if‖Q
JcJ Q
−1
JJsign
(wJ )‖∞
61
where
Q=
limn→
+∞
1n
∑ni=
1x
i x⊤i∈
Rp×
p(covarian
cem
atrix)
•The
Lasso
isusu
ally
not
model-co
nsiste
nt
–Selects
more
variables
than
necessary
(see,e.g.,
Lv
and
Fan
,2009)
–Fixin
gth
eLasso
:ad
aptive
Lasso
(Zou
,2006),
relaxed
Lasso
(Mein
shau
sen,
2008),th
reshold
ing
(Lou
nici,
2008),
Bolasso
(Bach
,2008a),
stability
selection(M
einsh
ausen
and
Buhlm
ann,2008),
Wasserm
anan
dRoed
er(2009)
Adaptiv
eLasso
and
conca
ve
penaliza
tion
•Adaptive
Lasso
(Zou
,2006;
Huan
get
al.,2008)
–W
eighted
ℓ1 -n
orm:
min
w∈
RpL
(w)+
λ
p∑j=
1
|wj |
|wj | α
–w
estimator
obtain
edfrom
ℓ2
orℓ1
regularization
•Refo
rmula
tion
inte
rms
ofco
nca
vepenaliza
tion
min
w∈
RpL
(w)+
p∑j=
1
g(|w
j |)
–Exam
ple:
g(|w
j |)=
|wj | 1
/2
orlog|w
j |.Closer
toth
eℓ0
pen
alty
–Con
cave-convex
proced
ure:
replace
g(|w
j |)by
affine
upper
bou
nd
–B
ettersp
arsity-inducin
gprop
erties(F
anan
dLi,
2001;Zou
and
Li,
2008;Zhan
g,2008b
)
Bola
sso(B
ach
,2008a)
•Pro
perty
:for
asp
ecific
choice
ofregu
larizationparam
eterλ≈
√n:
–all
variables
inJ
arealw
aysselected
with
high
probab
ility
–all
other
ones
selectedw
ithprob
ability
in(0,1)
•U
seth
ebootstrap
tosim
ulate
severalrep
lications
–In
tersecting
supports
ofvariab
les
–Fin
alestim
ationof
won
the
entire
dataset
J 2 1 J
Bo
otstra
p 4
Bo
otstra
p 5
Bo
otstra
p 2
Bo
otstra
p 3
Bo
otstra
p 1
Inte
rsectio
n 5 4 3J J J
Modelse
lectio
nco
nsiste
ncy
ofth
eLasso
/B
ola
sso
•prob
abilities
ofselection
ofeach
variable
vs.regu
larizationparam
.µ
LA
SSO
−lo
g(µ
)
variable index
05
10
15
5
10
15
−lo
g(µ
)
variable index
05
10
15
5
10
15
BO
LA
SSO
−lo
g(µ
)
variable index
05
10
15
5
10
15
−lo
g(µ
)
variable index
05
10
15
5
10
15
Support
recoverycon
dition
satisfied
not
satisfied
Hig
h-d
imensio
nalin
fere
nce
Goin
gbeyo
nd
exact
support
reco
very
•T
heoreticalresu
ltsusu
allyassu
me
that
non
-zerow
jare
largeen
ough
,
i.e.,|wj |
>σ√
log
pn
•M
ayin
clude
too
many
variable
sbut
stillpre
dict
well
•O
raclein
equalities
–Pred
ictas
well
asth
eestim
atorob
tained
with
the
know
ledge
ofJ
–A
ssum
ei.i.d
.G
aussian
noise
with
variance
σ2
–W
ehave:
1nE‖X
wora
cle −
Xw‖22
=σ
2|J|n
Hig
h-d
imensio
nalin
fere
nce
Varia
ble
sele
ction
with
out
com
puta
tionallim
its
•A
pproach
esbased
onpen
alizedcriteria
(closeto
BIC
)
min
J⊂{1,...,p}
{m
inw
J ∈R|J| ‖y
−X
Jw
J ‖22
}+
Cσ
2|J| (1+
logp|J| )
•O
racle
inequality
ifdata
generated
byw
with
knon
-zeros(M
assart,
2003;B
unea
etal.,
2007):
1n ‖Xw−
Xw‖22
6C
kσ
2
n
(1+
logpk
)
•G
aussian
noise
-N
oassu
mptio
ns
regard
ing
corre
latio
ns
•Sca
ling
betw
een
dim
ensio
ns:
klo
gp
nsm
all
•O
ptim
alin
the
min
imax
sense
Hig
h-d
imensio
nalin
fere
nce
Varia
ble
sele
ction
with
orth
ogonaldesig
n
•O
rthogonaldesig
n:
assum
eth
at1nX
⊤X
=I
•Lasso
iseq
uivalen
tto
soft-thresh
oldin
g1nX
⊤Y
∈R
p
–Solu
tion:
wj
=soft-th
reshold
ing
of1nX
⊤jy
=w
j+
1nX
⊤jε
atλn
t
+(|t|−
a)
−a
a
sign
(t)
min
w∈
R
12w
2−w
t+
a|w|
Solu
tionw
=(|t|−
a)+
sign(t)
Hig
h-d
imensio
nalin
fere
nce
Varia
ble
sele
ction
with
orth
ogonaldesig
n
•O
rthogonaldesig
n:
assum
eth
at1nX
⊤X
=I
•Lasso
iseq
uivalen
tto
soft-thresh
oldin
g1nX
⊤Y
∈R
p
–Solu
tion:
wj
=soft-th
reshold
ing
of1nX
⊤jy
=w
j+
1nX
⊤jε
atλn
–Take
λ=
Aσ √
nlog
p
•W
here
does
the
logp
=O
(n)
com
efro
m?
–Exp
ectationof
the
maxim
um
ofp
Gau
ssianvariab
les≈√
logp
–U
nion
-bou
nd:
P(∃
j∈
Jc,|X
⊤jε|
>λ)
6∑
j∈J
cP(|X
⊤jε|
>λ)
6|J
c|e −λ2
2n
σ2
6pe −
A22
log
p=
p1−
A22
Hig
h-d
imensio
nalin
fere
nce
(Lasso
)
•M
ain
resu
lt:we
only
need
klog
p=
O(n
)
–if
wis
suffi
ciently
sparse
–and
input
variables
arenot
too
correlated
•Precise
condition
son
covariance
matrix
Q=
1nX
⊤X
.
–M
utu
alin
cohere
nce
(Lou
nici,
2008)
–Restricte
deig
enva
lue
conditio
ns
(Bickel
etal.,
2009)
–Sparse
eigenvalu
es(M
einsh
ausen
and
Yu,2008)
–N
ull
space
property
(Don
oho
and
Tan
ner,
2005)
•Lin
ksw
ithsign
alpro
cessing
and
compressed
sensin
g(C
andes
and
Wakin
,2008)
•A
ssum
eth
atQ
has
unit
diagon
al
Mutu
alin
cohere
nce
(unifo
rmlo
wco
rrela
tions)
•Theore
m(L
ounici,
2008):
–y
i=
w⊤x
i+
εi ,
εi.i.d
.norm
alw
ithm
eanzero
and
variance
σ2
–Q
=X
⊤X
/nw
ithunit
diagon
alan
dcross-term
sless
than
1
14k–
if‖w‖0
6k,an
dA
2>
8,th
en,w
ithλ
=A
σ √n
logp
P
(‖w−
w‖∞
65A
σ
(log
p
n
)1/2)
>1−
p1−
A2/8
•M
odel
consisten
cyby
thresh
oldin
gif
min
j,wj 6=
0 |wj |
>C
σ
√
logp
n
•M
utu
alin
coheren
cecon
dition
dep
ends
strongly
onk
•Im
provedresu
ltby
averaging
oversp
arsitypattern
s(C
andes
and
Plan
,
2009b)
Restricte
deig
envalu
eco
nditio
ns
•Theore
m(B
ickelet
al.,2009):
–assu
me
κ(k
)2
=m
in|J|6
km
in∆
,‖∆
Jc ‖
16‖∆
J ‖1
∆⊤Q
∆
‖∆J ‖
22
>0
–assu
me
λ=
Aσ √
nlog
pan
dA
2>
8
–th
en,w
ithprob
ability
1−p1−
A2/
8,we
have
estimation
error‖w
−w‖1
616A
κ2(k
) σk
√
logp
n
prediction
error1n ‖X
w−
Xw‖22
616A
2
κ2(k
) σ2k
nlog
p
•Con
dition
imposes
apoten
tiallyhid
den
scaling
betw
een(n
,p,k
)
•Con
dition
always
satisfied
forQ
=I
Check
ing
suffi
cient
conditio
ns
•M
ost
ofth
eco
nditio
ns
arenot
com
puta
ble
inpolyn
om
ialtim
e
•Random
matrice
s
–Sam
ple
X∈
Rn×
pfrom
the
Gau
ssianen
semble
–Con
dition
ssatisfi
edw
ithhigh
probab
ilityfor
certain(n
,p,k
)
–Exam
ple
fromW
ainw
right
(2009):n
>C
klog
p
•Check
ing
with
conve
xoptim
izatio
n
–Relax
condition
sto
convex
optim
izationprob
lems
(d’A
spremon
t
etal.,
2008;Ju
ditsky
and
Nem
irovski,2008;
d’A
spremon
tan
d
ElG
haou
i,2008)
–Exam
ple:
sparse
eigenvalu
esm
in|J|6
kλ
min (Q
JJ)
–O
pen
problem
:verifi
able
assum
ption
sstill
leadto
weaker
results
Sparse
meth
ods
Com
mon
exte
nsio
ns
•Rem
ovin
gbia
softh
eestim
ato
r
–K
eepth
eactive
set,an
dperform
unregu
larizedrestricted
estimation
(Can
des
and
Tao,
2007)
–B
etterth
eoreticalbou
nds
–Poten
tialprob
lems
ofrob
ustn
ess
•Ela
sticnet
(Zou
and
Hastie,
2005)
–Rep
laceλ‖w‖
1by
λ‖w‖1+
ε‖w‖22
–M
aketh
eop
timization
strongly
convex
with
uniq
ue
solution
–B
etterbeh
aviorw
ithheavily
correlatedvariab
les
Rele
vance
ofth
eore
ticalre
sults
•M
ost
resu
ltsonly
for
the
square
loss
–Exten
dto
other
losses(V
anD
eG
eer,2008;
Bach
,2009b
)
•M
ost
resu
ltsonly
for
ℓ1 -re
gulariza
tion
–M
aybe
extended
tooth
ernorm
s(see,
e.g.,H
uan
gan
dZhan
g,
2009;B
ach,2008b
)
•Conditio
non
corre
latio
ns
–very
restrictive,far
fromresu
ltsfor
BIC
pen
alty
•N
on
sparse
genera
ting
vecto
r
–little
work
onrob
ustn
essto
lackof
sparsity
•Estim
atio
nofre
gulariza
tion
para
mete
r
–N
osatisfactory
solution
⇒op
enprob
lem
Alte
rnativ
esp
arsem
eth
ods
Gre
edy
meth
ods
•Forw
ardselection
•Forw
ard-b
ackward
selection
•N
on-con
vexm
ethod
–H
arder
toan
alyze
–Sim
pler
toim
plem
ent
–Prob
lems
ofstab
ility
•Positive
theoretical
results
(Zhan
g,2009,
2008a)
–Sim
ilarsu
fficien
tcon
dition
sth
anfor
the
Lasso
Alte
rnativ
esp
arsem
eth
ods
Baye
sian
meth
ods
•Lasso:
min
imize
∑ni=
1(y
i −w
⊤x
i )2+
λ‖w‖1
–Equivalen
tto
MA
Pestim
ationw
ithG
aussian
likelihood
and
factorizedLapla
ceprior
p(w
)∝∏
pj=
1e −
λ|wj |
(Seeger,
2008)
–H
oweve
r,poste
riorputs
zero
weig
ht
on
exa
ctze
ros
•H
eavy-taileddistrib
ution
sas
aproxy
tosp
arsity
–Stu
den
tdistrib
ution
s(C
aronan
dD
oucet,
2008)
–G
eneralized
hyp
erbolic
priors(A
rcham
beau
and
Bach
,2008)
–In
stance
ofau
tomatic
relevance
determ
ination
(Neal,
1996)
•M
ixtures
of“D
iracs”an
dan
other
absolu
telycon
tinuou
sdistrib
ution
s,
e.g.,“sp
ikean
dslab
”(Ish
waran
and
Rao,
2005)
•Less
theory
than
frequen
tistm
ethods
Com
parin
gLasso
and
oth
er
strate
gie
sfo
rlin
ear
regre
ssion
•Com
pared
meth
ods
toreach
the
least-square
solution
–Rid
geregression
:m
inw∈
Rp
12 ‖y−
Xw‖
22+
λ2 ‖w‖22
–Lasso:
min
w∈
Rp
12 ‖y−
Xw‖
22+
λ‖w‖1
–Forw
ardgreed
y:
∗In
itializationw
ithem
pty
set
∗Seq
uen
tiallyad
dth
evariab
leth
atbest
reduces
the
square
loss
•Each
meth
od
build
sa
path
ofsolu
tions
from0
toord
inary
least-
squares
solution
•Regu
larizationparam
etersselected
onth
etest
set
Sim
ula
tion
resu
lts
•i.i.d
.G
aussian
design
matrix,
k=
4,n
=64,
p∈
[2,256],SN
R=
1
•N
otestab
ilityto
non
-sparsity
and
variability
24
68
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
log
2 (p)
mean square error
L1
L2
gre
ed
y
ora
cle
24
68
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
log
2 (p)
mean square error
L1
L2
gre
ed
y
Sparse
Rotated
(non
sparse)
Sum
mary
ℓ1 -n
orm
regulariza
tion
•ℓ1 -n
ormregu
larizationlead
sto
non
smooth
optim
izationprob
lems
–an
alysisth
rough
direction
alderivatives
orsu
bgrad
ients
–op
timization
may
orm
aynot
takead
vantage
ofsp
arsity
•ℓ1 -n
ormregu
larizationallow
shigh
-dim
ension
alin
ference
•In
teresting
problem
sfor
ℓ1 -regu
larization
–Stab
levariab
leselection
–W
eakersu
fficien
tcon
dition
s(for
weaker
results)
–Estim
ationof
regularization
param
eter(all
bou
nds
dep
end
onth
e
unkn
own
noise
variance
σ2)
Exte
nsio
ns
•Sparse
meth
ods
arenot
limite
dto
the
square
loss
–e.g.,
theoretical
results
forlogistic
loss(V
anD
eG
eer,2008;
Bach
,
2009b)
•Sparse
meth
ods
arenot
limite
dto
supervise
dle
arnin
g
–Learn
ing
the
structu
reof
Gau
ssiangrap
hical
models
(Mein
shau
sen
and
Buhlm
ann,2006;
Ban
erjeeet
al.,2008)
–Sparsity
onm
atrices(last
part
ofth
etu
torial)
•Sparse
meth
ods
arenot
limite
dto
variable
sele
ction
ina
linear
model
–See
next
part
ofth
etu
toria
l
Questio
ns?
Sparse
meth
ods
for
mach
ine
learn
ing
Outlin
e
•In
troductio
n-O
vervie
w
•Sparse
linear
estim
atio
nw
ithth
eℓ1 -n
orm
–Con
vexop
timization
and
algorithm
s
–T
heoretical
results
•Stru
cture
dsp
arsem
eth
ods
on
vecto
rs
–G
roups
offeatu
res/
Multip
lekern
ellearn
ing
–Exten
sions
(hierarch
icalor
overlappin
ggrou
ps)
•Sparse
meth
ods
on
matrice
s
–M
ulti-task
learnin
g
–M
atrixfactorization
(low-ran
k,sp
arsePCA
,diction
arylearn
ing)
Penaliza
tion
with
gro
uped
varia
ble
s
(Yuan
and
Lin
,2006)
•A
ssum
eth
at{1,...,p}is
partitio
ned
into
mgrou
ps
G1 ,...,G
m
•Pen
alizationby
∑mi=
1 ‖wG
i ‖2 ,
oftencalled
ℓ1 -ℓ
2norm
•In
duces
group
sparsity
–Som
egrou
ps
entirely
setto
zero
–no
zerosw
ithin
groups
•In
this
tutorial:
–G
roups
may
have
infinite
size⇒
MK
L
–G
roups
may
overlap⇒
structu
red
sparsity
Lin
ear
vs.
non-lin
ear
meth
ods
•A
llm
ethods
inth
istu
torialare
linear
inth
epara
mete
rs
•B
yrep
lacing
xby
features
Φ(x
),th
eycan
be
mad
enon
linear
in
the
data
•Im
plicit
vs.exp
licitfe
atu
res
–ℓ1 -n
orm:
explicit
features
–ℓ2 -n
orm:
representer
theorem
allows
tocon
sider
implicit
features
if
their
dot
products
canbe
computed
easily(kern
elm
ethods)
Kern
elm
eth
ods:
regulariza
tion
by
ℓ2 -n
orm
•D
ata:x
i ∈X
,y
i ∈Y
,i=
1,...,n,w
ithfe
atu
res
Φ(x
)∈F
=R
p
–Pred
ictorf(x
)=
w⊤Φ
(x)
linear
inth
efeatu
res
•O
ptim
izationprob
lem:
min
w∈
Rp
n∑i=
1
ℓ(yi ,w
⊤Φ
(xi ))
+λ2 ‖w‖
22
Kern
elm
eth
ods:
regulariza
tion
by
ℓ2 -n
orm
•D
ata:x
i ∈X
,y
i ∈Y
,i=
1,...,n,w
ithfe
atu
res
Φ(x
)∈F
=R
p
–Pred
ictorf(x
)=
w⊤Φ
(x)
linear
inth
efeatu
res
•O
ptim
izationprob
lem:
min
w∈
Rp
n∑i=
1
ℓ(yi ,w
⊤Φ
(xi ))
+λ2 ‖w‖
22
•Repre
sente
rth
eore
m(K
imeld
orfan
dW
ahba,
1971):solu
tionm
ust
be
ofth
eform
w=
∑ni=
1α
i Φ(x
i )
–Equivalen
tto
solving:
min
α∈R
n
n∑i=
1
ℓ(yi ,(K
α)i )
+λ2α⊤K
α
–K
ernel
matrix
Kij
=k(x
i ,xj )
=Φ
(xi ) ⊤
Φ(x
j )
Multip
lekern
elle
arnin
g(M
KL)
(Lanck
riet
et
al.,
2004b;B
ach
et
al.,
2004a)
•Sparse
meth
ods
arelin
ear!
•Sparsity
with
non
-linearities
–rep
lacef(x
)=
∑pj=
1w
⊤jx
jw
ithx∈
Rp
and
wj ∈
R
–by
f(x
)=
∑pj=
1w
⊤jΦ
j (x)
with
x∈X
,Φ
j (x)∈
Fj
anw
j ∈F
j
•Rep
laceth
eℓ1 -n
orm∑
pj=
1 |wj |
by“b
lock”
ℓ1 -n
orm∑
pj=
1 ‖wj ‖
2
•Rem
arks
–H
ilbert
space
extension
ofth
egrou
pLasso
(Yuan
and
Lin
,2006)
–A
lternative
sparsity-in
ducin
gnorm
s(R
avikum
aret
al.,2008)
Multip
lekern
elle
arnin
g(M
KL)
(Lanck
riet
et
al.,
2004b;B
ach
et
al.,
2004a)
•M
ultip
lefeatu
rem
aps
/kern
elson
x∈X
:
–p
“feature
map
s”Φ
j:X
7→F
j ,j
=1,...,p
.
–M
inim
izationw
ithresp
ectto
w1 ∈
F1 ,...,w
p ∈F
p
–Pred
ictor:f(x
)=
w1 ⊤
Φ1 (x
)+···
+w
p ⊤Φ
p (x)
x
Φ1 (x
) ⊤w
1
ր...
...ց
−→Φ
j (x) ⊤
wj
−→ց
......
րΦ
p (x) ⊤
wp
w⊤1Φ
1 (x)+···
+w
⊤pΦ
p (x)
–G
eneralized
additive
models
(Hastie
and
Tib
shiran
i,1990)
Regulariza
tion
for
multip
lefe
atu
res
x
Φ1 (x
) ⊤w
1
ր...
...ց
−→Φ
j (x) ⊤
wj
−→ց
......
րΦ
p (x) ⊤
wp
w⊤1Φ
1 (x)+···
+w
⊤pΦ
p (x)
•Regu
larizationby
∑pj=
1 ‖wj ‖
22is
equivalen
tto
usin
gK
=∑
pj=
1K
j
–Sum
min
gkern
elsis
equivalen
tto
concaten
ating
feature
spaces
Regulariza
tion
for
multip
lefe
atu
res
x
Φ1 (x
) ⊤w
1
ր...
...ց
−→Φ
j (x) ⊤
wj
−→ց
......
րΦ
p (x) ⊤
wp
w⊤1Φ
1 (x)+···
+w
⊤pΦ
p (x)
•Regu
larizationby
∑pj=
1 ‖wj ‖
22is
equivalen
tto
usin
gK
=∑
pj=
1K
j
•Regu
larizationby
∑pj=
1 ‖wj ‖
2im
poses
sparsity
atth
egrou
plevel
•M
ain
questio
ns
when
regularizin
gby
blo
ckℓ1 -n
orm
:
1.A
lgorithm
s
2.A
nalysis
ofsp
arsityin
ducin
gprop
erties(R
avikum
aret
al.,2008;
Bach
,2008b
)
3.D
oes
itcorresp
ond
toa
specifi
ccom
bin
ationof
kernels?
Genera
lkern
elle
arnin
g
•Pro
positio
n(L
anckriet
etal,
2004,B
achet
al.,2005,
Micch
ellian
d
Pon
til,2005):
G(K
)=
min
w∈F
∑ni=
1ℓ(y
i ,w⊤Φ
(xi ))
+λ2 ‖w‖
22
=m
axα∈
Rn −
∑ni=
1ℓ ∗i (λ
αi )−
λ2 α⊤K
α
isa
conve
xfu
nction
ofth
ekern
elm
atrixK
•T
heoretical
learnin
gbou
nds
(Lan
ckrietet
al.,2004,
Srebro
and
Ben
-
David
,2006)
–Less
assum
ption
sth
ansp
arsity-based
bou
nds,
but
slower
rates
Equiv
ale
nce
with
kern
elle
arnin
g(B
ach
et
al.,
2004a)
•B
lock
ℓ1 -n
ormprob
lem:
n∑i=
1
ℓ(yi , w
⊤1Φ
1 (xi )
+···
+w
⊤pΦ
p (xi ))
+λ2
( ‖w1 ‖
2+···
+‖w
p ‖2 )
2
•Pro
positio
n:
Blo
ckℓ1 -n
ormregu
larizationis
equivalen
tto
min
imizin
gw
ithresp
ectto
ηth
eop
timal
value
G(∑
pj=
1η
j Kj )
•(sp
arse)weigh
tsη
obtain
edfrom
optim
alitycon
dition
s
•dual
param
etersα
optim
alfor
K=
∑pj=
1η
j Kj ,
•Sin
gle
optim
izatio
npro
ble
mfo
rle
arnin
gboth
ηand
α
Pro
ofofequiv
ale
nce
min
w1,...,w
p
n∑i=
1
ℓ(y
i ,
p∑j=
1
w⊤jΦ
j (xi )
)+
λ(
p∑j=
1 ‖wj ‖
2
)2
=m
inw
1,...,w
pm
inP
jη
j =1
n∑i=
1
ℓ(y
i ,
p∑j=
1
w⊤jΦ
j (xi )
)+
λ
p∑j=
1 ‖wj ‖
22 /ηj
=m
inP
jη
j =1
min
w1,...,w
p
n∑i=
1
ℓ(y
i ,
p∑j=
1
η1/2
jw
⊤jΦ
j (xi )
)+
λ
p∑j=
1 ‖wj ‖
22w
ithw
j=
wj η
−1/2
j
=m
inP
jη
j =1m
inw
n∑i=
1
ℓ(y
i ,w⊤Ψ
η (xi )
)+
λ‖w‖22
with
Ψη (x
)=
(η1/2
1Φ
1 (x),...,η
1/2
pΦ
p (x))
•W
ehave:
Ψη (x
) ⊤Ψ
η (x′)
=∑
pj=
1η
j kj (x
,x′)
with
∑pj=
1η
j=
1(an
dη
>0)
Alg
orith
ms
for
the
gro
up
Lasso
/M
KL
•G
roup
Lasso
–B
lock
coord
inate
descen
t(Y
uan
and
Lin
,2006)
–A
ctiveset
meth
od
(Roth
and
Fisch
er,2008;
Obozin
skiet
al.,2009)
–N
esterov’saccelerated
meth
od
(Liu
etal.,
2009)
•M
KL
–D
ual
ascent,
e.g.,seq
uen
tialm
inim
alop
timization
(Bach
etal.,
2004a)
–η-trick
+cu
tting-p
lanes
(Son
nen
burg
etal.,
2006)
–η-trick
+projected
gradien
tdescen
t(R
akotomam
onjy
etal.,
2008)
–A
ctiveset
(Bach
,2008c)
Applica
tions
ofm
ultip
lekern
elle
arnin
g
•Sele
ction
ofhyp
erp
aram
ete
rsfo
rkern
elm
eth
ods
•Fusio
nfro
mhete
rogeneousdata
source
s(L
anckriet
etal.,
2004a)
•T
wo
strategiesfor
kernel
combin
ations:
–U
niform
combin
ation⇔
ℓ2 -n
orm
–Sparse
combin
ation⇔
ℓ1 -n
orm
–M
KL
always
leads
tom
orein
terpretable
models
–M
KL
does
not
always
leadto
better
predictive
perform
ance
∗In
particu
lar,w
ithfew
well-d
esigned
kernels
∗B
ecarefu
lw
ithnorm
alizationof
kernels
(Bach
etal.,
2004b)
Applica
tions
ofm
ultip
lekern
elle
arnin
g
•Sele
ction
ofhyp
erp
aram
ete
rsfo
rkern
elm
eth
ods
•Fusio
nfro
mhete
rogeneousdata
source
s(L
anckriet
etal.,
2004a)
•T
wo
strategiesfor
kernel
combin
ations:
–U
niform
combin
ation⇔
ℓ2 -n
orm
–Sparse
combin
ation⇔
ℓ1 -n
orm
–M
KL
always
leads
tom
orein
terpretable
models
–M
KL
does
not
always
leadto
better
predictive
perform
ance
∗In
particu
lar,w
ithfew
well-d
esigned
kernels
∗B
ecarefu
lw
ithnorm
alizationof
kernels
(Bach
etal.,
2004b)
•Sparse
meth
ods:
new
possib
ilitiesan
dnew
features
•See
NIP
S2009
worksh
op“U
nderstan
din
gM
KL
meth
ods”
Sparse
meth
ods
for
mach
ine
learn
ing
Outlin
e
•In
troductio
n-O
vervie
w
•Sparse
linear
estim
atio
nw
ithth
eℓ1 -n
orm
–Con
vexop
timization
and
algorithm
s
–T
heoretical
results
•Stru
cture
dsp
arsem
eth
ods
on
vecto
rs
–G
roups
offeatu
res/
Multip
lekern
ellearn
ing
–Exten
sions
(hierarch
icalor
overlappin
ggrou
ps)
•Sparse
meth
ods
on
matrice
s
–M
ulti-task
learnin
g
–M
atrixfactorization
(low-ran
k,sp
arsePCA
,diction
arylearn
ing)
Lasso
-T
wo
main
rece
nt
theore
ticalre
sults
1.Support
reco
very
conditio
n
2.Exp
onentia
llym
any
irrele
vant
variable
s:under
appropriate
assum
ption
s,con
sistency
ispossib
leas
long
as
logp
=O
(n)
Lasso
-T
wo
main
rece
nt
theore
ticalre
sults
1.Support
reco
very
conditio
n
2.Exp
onentia
llym
any
irrele
vant
variable
s:under
appropriate
assum
ption
s,con
sistency
ispossib
leas
long
as
logp
=O
(n)
•Q
uestion
:is
itpossib
leto
build
asp
arsealgorith
mth
atcan
learn
fromm
oreth
an10
80
features?
Lasso
-T
wo
main
rece
nt
theore
ticalre
sults
1.Support
reco
very
conditio
n
2.Exp
onentia
llym
any
irrele
vant
variable
s:under
appropriate
assum
ption
s,con
sistency
ispossib
leas
long
as
logp
=O
(n)
•Q
uestion
:is
itpossib
leto
build
asp
arsealgorith
mth
atcan
learn
fromm
oreth
an10
80
features?
–Som
etyp
eofre
cursivity/
facto
rizatio
nis
needed!
Hie
rarchica
lkern
elle
arnin
g(B
ach
,2008c)
•M
any
kernels
canbe
decom
posed
asa
sum
ofm
any
“small”
kernels
indexed
bya
certainset
V:
k(x
,x′)
=∑v∈
V
kv (x
,x′)
•Exam
ple
with
x=
(x1 ,...,x
q )∈R
q(⇒
non
linear
variable
selection)
–G
aussian
/AN
OVA
kernels:
p=
#(V
)=
2q
q∏j=
1
(
1+
e −α(x
j −x′j )
2)
=∑
J⊂{1,...,q}
∏j∈J
e −α(x
j −x′j )
2
=∑
J⊂{1,...,q}
e −α‖
xJ −
x′J ‖
22
–N
B:decom
position
isrelated
toCosso
(Lin
and
Zhan
g,2006)
•Goal:
learnin
gsp
arsecom
bin
ation∑
v∈V
ηv k
v (x,x
′)
•U
niversally
consisten
tnon
-linear
variable
selectionreq
uires
allsu
bsets
Restrictin
gth
ese
tofactiv
ekern
els
•W
ithflat
structu
re
–Con
sider
blo
ckℓ1 -n
orm:
∑
v∈V
dv ‖w
v ‖2
–can
not
avoidbein
glin
earin
p=
#(V
)=
2q
•U
sing
the
structu
reof
the
small
kernels
1.for
computation
alreason
s
2.to
allowm
oreirrelevan
tvariab
les
Restrictin
gth
ese
tofactiv
ekern
els
•V
isen
dow
edw
itha
directed
acyclicgrap
h(D
AG
)stru
cture:
sele
cta
kern
elonly
afte
rall
ofits
ance
stors
have
been
sele
cted
•G
aussian
kernels:
V=
pow
erset
of{1,...,q}w
ithin
clusion
DAG
–Select
asu
bset
only
afterall
itssu
bsets
have
been
selected
23
34
14
13
24
123
234
124
134
1234
12
12
34
DAG
-adapte
dnorm
(Zhao
&Yu,2008)
•G
raph-b
asedstru
ctured
regularization
–D
(v)
isth
eset
ofdescen
dan
tsof
v∈
V:
∑v∈V
dv ‖w
D(v
) ‖2
=∑v∈
V
dv
∑
t∈D
(v) ‖w
t ‖22
1/2
•M
ainprop
erty:If
vis
selected,so
areall
itsan
cestors
23
34
14
13
24
123
234
124
134
1234
12
12
34
23
4
12
23
34
14
24
234
124
13
134
1234
123 1
DAG
-adapte
dnorm
(Zhao
&Yu,2008)
•G
raph-b
asedstru
ctured
regularization
–D
(v)
isth
eset
ofdescen
dan
tsof
v∈
V:
∑v∈V
dv ‖w
D(v
) ‖2
=∑v∈
V
dv
∑
t∈D
(v) ‖w
t ‖22
1/2
•M
ainprop
erty:If
vis
selected,so
areall
itsan
cestors
•H
ierarch
icalkern
elle
arnin
g(B
ach,2008c)
:
–polyn
om
ial-tim
ealgorith
mfor
this
norm
–nece
ssary/su
fficie
nt
conditio
ns
forcon
sistent
kernel
selection
–Sca
ling
betw
een
p,q,n
forcon
sistency
–Applica
tions
tovariab
leselection
oroth
erkern
els
Sca
ling
betw
een
p,n
and
oth
er
gra
ph-re
late
dquantitie
sn
=num
ber
ofob
servations
p=
num
ber
ofvertices
inth
eD
AG
deg
(V)
=m
aximum
out
degree
inth
eD
AG
num
(V)
=num
ber
ofcon
nected
compon
ents
inth
eD
AG
•Pro
positio
n(B
ach,2009a):
Assu
me
consisten
cycon
dition
satisfied
,
Gau
ssiannoise
and
data
generated
froma
sparse
function
,th
enth
e
support
isrecovered
with
high
-probab
ilityas
soon
as:
logdeg
(V)+
lognum
(V)
=O
(n)
Sca
ling
betw
een
p,n
and
oth
er
gra
ph-re
late
dquantitie
sn
=num
ber
ofob
servations
p=
num
ber
ofvertices
inth
eD
AG
deg
(V)
=m
aximum
out
degree
inth
eD
AG
num
(V)
=num
ber
ofcon
nected
compon
ents
inth
eD
AG
•Pro
positio
n(B
ach,2009a):
Assu
me
consisten
cycon
dition
satisfied
,
Gau
ssiannoise
and
data
generated
froma
sparse
function
,th
enth
e
support
isrecovered
with
high
-probab
ilityas
soon
as:
logdeg
(V)+
lognum
(V)
=O
(n)
•U
nstru
ctured
case:num
(V)=
p⇒
logp
=O
(n)
•Pow
erset
ofq
elemen
ts:deg
(V)=
q⇒
logq
=log
logp
=O
(n)
Mean-sq
uare
erro
rs(re
gre
ssion)
dataset
np
k#
(V)
L2
greedy
MK
LH
KL
abalon
e4177
10pol4
≈10
744.2±
1.343.9±
1.444.5±
1.143.3±
1.0
abalon
e4177
10rb
f≈
1010
43.0±
0.9
45.0±1.7
43.7±1.0
43.0±1.1
boston
50613
pol4
≈10
917.1±
3.6
24.7±10.8
22.2±2.2
18.1±3.8
boston
50613
rbf
≈10
12
16.4±
4.0
32.4±8.2
20.7±2.1
17.1±4.7
pum
adyn
-32fh8192
32pol4
≈10
22
57.3±0.7
56.4±0.8
56.4±
0.7
56.4±0.8
pum
adyn
-32fh8192
32rb
f≈
1031
57.7±0.6
72.2±22.5
56.5±0.8
55.7±
0.7
pum
adyn
-32fm8192
32pol4
≈10
22
6.9±0.1
6.4±1.6
7.0±0.1
3.1±
0.0
pum
adyn
-32fm8192
32rb
f≈
1031
5.0±0.1
46.2±51.6
7.1±0.1
3.4±
0.0
pum
adyn
-32nh
819232
pol4
≈10
22
84.2±1.3
73.3±25.4
83.6±1.3
36.7±
0.4
pum
adyn
-32nh
819232
rbf
≈10
31
56.5±1.1
81.3±25.0
83.7±1.3
35.5±
0.5
pum
adyn
-32nm
819232
pol4
≈10
22
60.1±1.9
69.9±32.8
77.5±0.9
5.5±
0.1
pum
adyn
-32nm
819232
rbf
≈10
31
15.7±0.4
67.3±42.4
77.6±0.9
7.2±
0.1
Exte
nsio
ns
tooth
er
kern
els
•Exten
sionto
graph
kernels,
string
kernels,
pyramid
match
kernels
AB
BA
BA
AA
AA
AB
BB
BA
AB
AA
BA
BA
BB
BA
BB
BB
AA
•Exp
loring
largefeatu
resp
acesw
ithstru
ctured
sparsity-in
ducin
gnorm
s
–O
pposite
viewth
antrad
itional
kernel
meth
ods
–In
terpretable
models
•O
ther
structu
res
than
hie
rarchie
sor
DAGs
Gro
uped
varia
ble
s
•Supervised
learnin
gw
ithkn
own
groups:
–T
he
ℓ1 -ℓ
2norm
∑G∈
G ‖wG ‖
2=
∑G∈
G
(∑j∈
G
w2j
)1/2,
with
Ga
partition
of{1,...,p}
–T
he
ℓ1 -ℓ
2norm
setsto
zeronon
-overlappin
ggrou
ps
ofvariab
les
(asop
posed
tosin
glevariab
lesfor
the
ℓ1
norm
)
Gro
uped
varia
ble
s
•Supervised
learnin
gw
ithkn
own
groups:
–T
he
ℓ1 -ℓ
2norm
∑G∈
G ‖wG ‖
2=
∑G∈
G
(∑j∈
G
w2j
)1/2,
with
Ga
partition
of{1,...,p}
–T
he
ℓ1 -ℓ
2norm
setsto
zeronon
-overlappin
ggrou
ps
ofvariab
les
(asop
posed
tosin
glevariab
lesfor
the
ℓ1
norm
).
•H
owever,
the
ℓ1 -ℓ
2norm
enco
des
fixe
d/sta
ticprio
rin
form
atio
n,
requires
tokn
owin
advan
cehow
togrou
pth
evariab
les
•W
hat
hap
pen
sif
the
setof
groupsG
isnot
apartition
anym
ore?
Stru
cture
dSparsity
(Jenatto
net
al.,
2009a)
•W
hen
pen
alizing
byth
eℓ1 -ℓ
2norm
∑G∈
G ‖wG ‖
2=
∑G∈
G
(∑j∈
G
w2j
)1/2
–T
he
ℓ1
norm
induces
sparsity
atth
egrou
plevel:
∗Som
ew
G’s
areset
tozero
–In
side
the
groups,
the
ℓ2
norm
does
not
promote
sparsity
•In
tuitively,
the
zeropattern
ofw
isgiven
by
{j∈{1,...,p};
wj
=0}
=⋃
G∈
G′ G
forsom
eG
′⊆G
.
•T
his
intu
itionis
actually
true
and
canbe
formalized
Exam
ple
sofse
tofgro
ups
G(1
/3)
•Selection
ofcon
tiguou
spattern
son
aseq
uen
ce,p
=6
–G
isth
eset
ofblu
egrou
ps
–A
ny
union
ofblu
egrou
ps
setto
zerolead
sto
the
selectionof
a
contigu
ous
pattern
Exam
ple
sofse
tofgro
ups
G(2
/3)
•Selection
ofrectan
gleson
a2-D
grids,
p=
25
–G
isth
eset
ofblu
e/greengrou
ps
(with
their
complem
ents,
not
disp
layed)
–A
ny
union
ofblu
e/greengrou
ps
setto
zerolead
sto
the
selection
ofa
rectangle
Exam
ple
sofse
tofgro
ups
G(3
/3)
•Selection
ofdiam
ond-sh
aped
pattern
son
a2-D
grids,
p=
25
–It
ispossib
leto
extentsu
chsettin
gsto
3-Dsp
ace,or
more
com
plex
topologies
–See
applica
tions
late
r(sp
arsePCA)
Rela
tionsh
ipbew
teen
Gand
Zero
Patte
rns
(Jenatto
n,A
udib
ert,
and
Bach
,2009a)
•G
→Zero
patte
rns:
–by
generatin
gth
eunion
-closure
ofG
•Zero
patte
rns→
G:
–D
esigngrou
psG
froman
yunio
n-clo
sed
set
ofze
ropattern
s
–D
esigngrou
ps
Gfrom
any
inte
rsectio
n-clo
sed
set
ofnon-ze
ro
pattern
s
Overv
iew
ofoth
er
work
on
structu
red
sparsity
•Specifi
chierarch
icalstru
cture
(Zhao
etal.,
2009;B
ach,2008c)
•U
nion
-closed(as
opposed
toin
tersection-closed
)fam
ilyof
non
zero
pattern
s(Jacob
etal.,
2009;B
araniu
ket
al.,2008)
•N
oncon
vexpen
altiesbased
onin
formation
-theoretic
criteriaw
ith
greedy
optim
ization(H
uan
get
al.,2009)
Sparse
meth
ods
for
mach
ine
learn
ing
Outlin
e
•In
troductio
n-O
vervie
w
•Sparse
linear
estim
atio
nw
ithth
eℓ1 -n
orm
–Con
vexop
timization
and
algorithm
s
–T
heoretical
results
•Stru
cture
dsp
arsem
eth
ods
on
vecto
rs
–G
roups
offeatu
res/
Multip
lekern
ellearn
ing
–Exten
sions
(hierarch
icalor
overlappin
ggrou
ps)
•Sparse
meth
ods
on
matrice
s
–M
ulti-task
learnin
g
–M
atrixfactorization
(low-ran
k,sp
arsePCA
,diction
arylearn
ing)
Learn
ing
on
matrice
s-
Colla
bora
tive
Filte
ring
(CF)
•G
ivennX
“movies”
x∈X
and
nY
“custom
ers”y∈Y
,
•pred
ictth
e“ratin
g”z(x
,y)∈
Zof
custom
ery
form
oviex
•Train
ing
data:
largenX×
nY
incom
plete
matrix
Zth
atdescrib
esth
e
know
nratin
gsof
some
custom
ersfor
some
movies
•G
oal:com
plete
the
matrix.
Learn
ing
on
matrice
s-M
ulti-ta
skle
arnin
g
•k
prediction
taskson
same
covariatesx∈
Rp
–k
weigh
tvectors
wj ∈
Rp
–Join
tm
atrixof
predictors
W=
(w1 ,...,w
k )∈R
p×k
•M
any
application
s
–“tran
sferlearn
ing”
–M
ulti-category
classification
(one
taskper
class)(A
mit
etal.,
2007)
•Share
param
etersbetw
eenvariou
stasks
–sim
ilarto
fixed
effect/ran
dom
effect
models
(Rau
den
bush
and
Bryk,
2002)
–join
tvariab
leor
feature
selection(O
bozin
skiet
al.,2009;
Pon
til
etal.,
2007)
Learn
ing
on
matrice
s-
Image
denoisin
g
•Sim
ultan
eously
den
oiseall
patch
esof
agiven
image
•Exam
ple
fromM
airal,B
ach,Pon
ce,Sap
iro,an
dZisserm
an(2009c)
Two
types
ofsp
arsityfo
rm
atrice
sM
∈R
n×p
I-
Dire
ctlyon
the
ele
ments
of
M
•M
any
zeroelem
ents:
Mij
=0
M
•M
any
zerorow
s(or
colum
ns):
(Mi1 ,...,M
ip )=
0
M
Two
types
ofsp
arsityfo
rm
atrice
sM
∈R
n×p
II-T
hro
ugh
afa
ctoriza
tion
of
M=
UV
⊤
•M
=U
V⊤,U
∈R
n×m
and
V∈
Rn×
m
•Low
rank:
msm
all
=
T
UV
M
•Sparse
deco
mpositio
n:
Usp
arseU=
VM
T
Stru
cture
dm
atrix
facto
rizatio
ns
-M
any
insta
nce
s
•M
=U
V⊤,U
∈R
n×m
and
V∈
Rp×
m
•Stru
cture
on
Uand/or
V
–Low
-rank:
Uan
dV
have
fewcolu
mns
–D
ictionary
learnin
g/
sparse
PCA
:U
orV
has
man
yzeros
–Clu
stering
(k-m
eans):
U∈{0,1}
n×m
,U
1=
1
–Poin
twise
positivity:
non
negative
matrix
factorization(N
MF)
–Specifi
cpattern
sof
zeros
–etc.
•M
any
applica
tions
–e.g.,
source
separation
(Fevotte
etal.,
2009),exp
loratorydata
analysis
Multi-ta
skle
arnin
g
•Join
tm
atrixof
predictors
W=
(w1 ,...,w
k )∈R
p×k
•Join
tvaria
ble
sele
ction
(Obozin
skiet
al.,2009)
–Pen
alizeby
the
sum
ofth
enorm
sof
rows
ofW
(group
Lasso)
–Select
variables
which
arepred
ictivefor
alltasks
Multi-ta
skle
arnin
g
•Join
tm
atrixof
predictors
W=
(w1 ,...,w
k )∈R
p×k
•Join
tvaria
ble
sele
ction
(Obozin
skiet
al.,2009)
–Pen
alizeby
the
sum
ofth
enorm
sof
rows
ofW
(group
Lasso)
–Select
variables
which
arepred
ictivefor
alltasks
•Join
tfe
atu
rese
lectio
n(P
ontil
etal.,
2007)
–Pen
alizeby
the
trace-norm
(seelater)
–Con
struct
linear
features
comm
onto
alltasks
•T
heory:
allows
num
ber
ofob
servations
which
issu
blin
earin
the
num
ber
oftasks
(Obozin
skiet
al.,2008;
Lou
nici
etal.,
2009)
•Practice:
more
interpretab
lem
odels,
slightly
improved
perform
ance
Low
-rank
matrix
facto
rizatio
ns
Tra
cenorm
•G
ivena
matrix
M∈
Rn×
p
–Ran
kof
Mis
the
min
imum
sizem
ofall
factorizations
ofM
into
M=
UV
⊤,U
∈R
n×m
and
V∈
Rp×
m
–Sin
gular
value
decom
position
:M
=U
Diag
(s)V⊤
where
Uan
dV
have
orthon
ormal
colum
ns
and
s∈R
m+are
singu
larvalu
es
•Ran
kof
Meq
ual
toth
enum
ber
ofnon
-zerosin
gular
values
Low
-rank
matrix
facto
rizatio
ns
Tra
cenorm
•G
ivena
matrix
M∈
Rn×
p
–Ran
kof
Mis
the
min
imum
sizem
ofall
factorizations
ofM
into
M=
UV
⊤,U
∈R
n×m
and
V∈
Rp×
m
–Sin
gular
value
decom
position
:M
=U
Diag
(s)V⊤
where
Uan
dV
have
orthon
ormal
colum
ns
and
s∈R
m+are
singu
larvalu
es
•Ran
kof
Meq
ual
toth
enum
ber
ofnon
-zerosin
gular
values
•Tra
ce-n
orm
(a.k
.a.nucle
arnorm
)=
sum
ofsin
gular
values
•Con
vexfu
nction
,lead
sto
asem
i-defi
nite
program(F
azelet
al.,2001)
•First
used
forcollab
orativefilterin
g(S
rebroet
al.,2005)
Resu
ltsfo
rth
etra
cenorm
•Ran
krecovery
condition
(Bach
,2008d
)
–T
he
Hessian
ofth
eloss
around
the
asymptotic
solution
shou
ldbe
closeto
diagon
al
•Suffi
cient
condition
forexact
rank
min
imization
(Rech
tet
al.,2009)
•H
igh-d
imen
sional
inferen
cefor
noisy
matrix
completion
(Srebro
etal.,
2005;Can
des
and
Plan
,2009a)
–M
ayrecover
entire
matrix
fromsligh
tlym
oreen
triesth
anth
e
min
imum
ofth
etw
odim
ension
s
•Effi
cient
alg
orith
ms:
–First-ord
erm
ethodsbased
onth
esin
gular
value
decom
position
(see,
e.g.,M
azum
der
etal.,
2009)
–Low
-rank
formulation
s(R
ennie
and
Srebro,
2005;A
bern
ethy
etal.,
2009)
Spectra
lre
gulariza
tions
•Exten
sions
toan
yfu
nction
sof
singu
larvalu
es
•Exten
sions
tobilin
earform
s(A
bern
ethy
etal.,
2009)
(x,y
)7→Φ
(x) ⊤
BΨ
(y)
onfeatu
resΦ
(x)∈
RfX
and
Ψ(y
)∈R
fY,an
dB
∈R
fX×
fY
•Collab
orativefilterin
gw
ithattrib
utes
•Repre
sente
rth
eore
m:
the
solution
must
be
ofth
eform
B=
nX
∑i=1
nY
∑j=
1
αij Ψ
(xi )Φ
(yj ) ⊤
•O
nly
norm
sin
variantby
orthogon
altransform
s(A
rgyriouet
al.,2009)
Sparse
prin
cipalco
mponent
analy
sis
•G
ivendata
matrix
X=
(x⊤1,...,x
⊤n) ⊤
∈R
n×p,
princip
alcom
pon
ent
analysis
(PCA
)m
aybe
seenfrom
two
persp
ectives:
–Analysis
view
:find
the
projectionv∈
Rp
ofm
aximum
variance
(with
defl
ationto
obtain
more
compon
ents)
–Syn
thesis
view
:find
the
basis
v1 ,...,v
ksu
chth
atall
xihave
low
reconstru
ctionerror
when
decom
posed
onth
isbasis
•For
regular
PCA
,th
etw
oview
sare
equivalen
t
•Sparse
extension
s
–In
terpretability
–H
igh-d
imen
sional
inferen
ce
–Two
view
sare
diff
ere
nts
Sparse
prin
cipalco
mponent
analy
sis
Analy
sisvie
w
•D
SPCA
(d’A
spremon
tet
al.,2007),
with
A=
1nX
⊤X
∈R
p×p
–m
ax‖v‖
2=
1,‖
v‖06
kv ⊤
Av
relaxedin
tom
ax‖v‖
2=
1,‖
v‖16
k1/2v ⊤
Av
–usin
gM
=vv ⊤
,itself
relaxedin
tom
axM
<0,tr
M=
1, 1 ⊤
|M|1
6ktr
AM
Sparse
prin
cipalco
mponent
analy
sis
Analy
sisvie
w
•D
SPCA
(d’A
spremon
tet
al.,2007),
with
A=
1nX
⊤X
∈R
p×p
–m
ax‖v‖
2=
1, ‖
v‖06
kv ⊤
Av
relaxedin
tom
ax‖v‖
2=
1,‖
v‖16
k1/2v ⊤
Av
–usin
gM
=vv ⊤
,itself
relaxedin
tom
axM
<0,tr
M=
1, 1 ⊤
|M|1
6ktr
AM
•Req
uires
defl
ationfor
multip
lecom
pon
ents
(Mackey,
2009)
•M
orerefi
ned
convex
relaxation(d
’Asprem
ont
etal.,
2008)
•N
oncon
vexan
alysis(M
oghad
dam
etal.,
2006b)
•A
pplication
sbeyon
din
terpretable
princip
alcom
pon
ents
–used
assu
fficien
tcon
dition
sfor
high
-dim
ension
alin
ference
Sparse
prin
cipalco
mponent
analy
sis
Synth
esis
vie
w
•Fin
dv1 ,...,v
m∈
Rp
sparse
soth
at
n∑i=
1
min
u∈R
m
∥∥∥∥x
i −m
∑j=
1
uj v
j
∥∥∥∥
22
issm
all
•Equivalen
tto
look
forU
∈R
n×m
and
V∈
Rp×
msu
chth
atV
is
sparse
and‖X
−U
V⊤‖
2Fis
small
Sparse
prin
cipalco
mponent
analy
sis
Synth
esis
vie
w
•Fin
dv1 ,...,v
m∈
Rp
sparse
soth
at
n∑i=
1
min
u∈R
m
∥∥∥∥x
i −m
∑j=
1
uj v
j
∥∥∥∥
22
issm
all
•Equivalen
tto
look
forU
∈R
n×m
and
V∈
Rp×
msu
chth
atV
is
sparse
and‖X
−U
V⊤‖
2Fis
small
•Sparse
formulation
(Witten
etal.,
2009;B
achet
al.,2008)
–Pen
alizecolu
mns
viof
Vby
the
ℓ1 -n
ormfor
sparsity
–Pen
alizecolu
mns
uiof
Uby
the
ℓ2 -n
ormto
avoidtrivial
solution
s
min
U,V
‖X−
UV
⊤‖2F
+λ
m∑i=
1
{‖ui ‖
22+‖v
i ‖21
}
Stru
cture
dm
atrix
facto
rizatio
ns
min
U,V
‖X−
UV
⊤‖2F
+λ
m∑i=
1
{‖ui ‖
2+
‖vi ‖
2}
•Pen
alizing
by‖u
i ‖2+
‖vi ‖
2eq
uivalen
tto
constrain
ing‖u
i ‖6
1an
d
pen
alizing
by‖v
i ‖(B
achet
al.,2008)
•O
ptim
izatio
nby
alte
rnatin
gm
inim
izatio
n(n
on-co
nve
x)
•u
idecom
position
coeffi
cients
(or“co
de”),
vidiction
aryelem
ents
•Sparse
PCA
=sp
arsedictio
nary
(ℓ1 -n
ormon
ui )
•D
ictionary
learn
ing
=sp
arsedeco
mpositio
ns
(ℓ1 -n
ormon
vi)
–O
lshau
senan
dField
(1997);Elad
and
Aharon
(2006);Rain
aet
al.
(2007)
Dictio
nary
learn
ing
for
image
denoisin
g
x︸︷︷︸
measu
remen
ts
=x
︸︷︷︸
origin
alim
age +
ε︸︷︷︸
noise
Dictio
nary
learn
ing
for
image
denoisin
g
•Solvin
gth
edenoisin
gpro
ble
m(E
ladan
dA
haron
,2006)
–Extract
alloverlap
pin
g8×
8patch
esx
i ∈R
64.
–Form
the
matrix
X=
[x1 ,...,x
n] ⊤
∈R
n×64
–Solve
am
atrixfactorization
problem
:
min
U,V
||X−
UV
⊤|| 2F=
n∑i=
1 ||xi −
VU
(i,:)|| 22
where
Uis
sparse
,an
dV
isth
edictio
nary
–Each
patch
isdecom
posed
into
xi=
VU
(i,:)
–A
verageth
erecon
struction
VU
(i,:)of
eachpatch
xito
reconstru
ct
afu
ll-sizedim
age
•T
he
num
ber
ofpatch
esn
islarge
(=num
ber
ofpixels)
Onlin
eoptim
izatio
nfo
rdictio
nary
learn
ing
min
U∈
Rn×
m,V
∈C
n∑i=
1 ||xi −
VU
(i,:)|| 22+
λ||U(i,:)||1
C△={V
∈R
p×m
s.t.∀j
=1,...,m
,||V
(:,j)||26
1}.
•Classical
optim
izationaltern
atesbetw
eenU
and
V
•G
ood
results,
but
veryslow
!
Onlin
eoptim
izatio
nfo
rdictio
nary
learn
ing
min
U∈
Rn×
m,V
∈C
n∑i=
1 ||xi −
VU
(i,:)|| 22+
λ||U(i,:)||1
C△={V
∈R
p×m
s.t.∀j
=1,...,m
,||V
(:,j)||26
1}.
•Classical
optim
izationaltern
atesbetw
eenU
and
V.
•G
ood
results,
but
veryslow
!
•O
nlin
ele
arnin
g(M
airal,B
ach,Pon
ce,an
dSap
iro,2009a)
can
–han
dle
poten
tiallyin
finite
datasets
–ad
apt
todyn
amic
trainin
gsets
Denoisin
gre
sult
(Maira
l,B
ach
,Ponce
,Sapiro
,and
Zisse
rman,2009c)
Denoisin
gre
sult
(Maira
l,B
ach
,Ponce
,Sapiro
,and
Zisse
rman,2009c)
What
does
the
dictio
nary
Vlo
ok
like?
Inpain
ting
a12-M
pix
elphoto
gra
ph
Inpain
ting
a12-M
pix
elphoto
gra
ph
Inpain
ting
a12-M
pix
elphoto
gra
ph
Inpain
ting
a12-M
pix
elphoto
gra
ph
Alte
rnativ
eusa
ges
ofdictio
nary
learn
ing
•U
sesth
e“co
de”
Uas
representation
ofob
servations
forsu
bseq
uen
t
processin
g(R
aina
etal.,
2007;Yan
get
al.,2009)
•A
dap
tdiction
aryelem
ents
tosp
ecific
tasks(M
airalet
al.,2009b
)
–D
iscrimin
ativetrain
ing
forweakly
supervised
pixelclassifi
cation(M
airal
etal.,
2008)
Sparse
Stru
cture
dPCA
(Jenatto
n,O
bozin
ski,
and
Bach
,2009b)
•Learn
ing
sparse
and
structu
red
diction
aryelem
ents:
min
U,V
‖X−
UV
⊤‖2F
+λ
m∑i=
1
{‖ui ‖
2+‖v
i ‖2}
•Stru
ctured
norm
onth
ediction
aryelem
ents
–grou
ped
pen
altyw
ithoverlap
pin
ggrou
ps
toselect
specifi
cclasses
ofsp
arsitypattern
s
–use
priorin
formation
forbetter
reconstru
ctionan
d/or
added
robustn
ess
•Effi
cient
learnin
gpro
cedures
throu
ghη-tricks
(closedform
updates)
Applica
tion
tofa
cedata
base
s(1
/3)
rawdata
(unstru
ctured
)N
MF
•N
MF
obtain
spartially
local
features
Applica
tion
tofa
cedata
base
s(2
/3)
(unstru
ctured
)sp
arsePCA
Stru
ctured
sparse
PCA
•Enforce
selectionof
convex
non
zeropattern
s⇒
robustn
essto
occlu
sion
Applica
tion
tofa
cedata
base
s(2
/3)
(unstru
ctured
)sp
arsePCA
Stru
ctured
sparse
PCA
•Enforce
selectionof
convex
non
zeropattern
s⇒
robustn
essto
occlu
sion
Applica
tion
tofa
cedata
base
s(3
/3)
•Q
uan
titativeperform
ance
evaluation
onclassifi
cationtask
20
40
60
80
10
01
20
14
05
10
15
20
25
30
35
40
45
Dic
tion
ary
siz
e
% Correct classification
raw
da
taP
CA
NM
FS
PC
Ash
are
d−
SP
CA
SS
PC
Ash
are
d−
SS
PC
A
Topic
models
and
matrix
facto
rizatio
n
•Late
nt
Dirich
let
allo
catio
n(B
leiet
al.,2003)
–For
adocu
men
t,sam
ple
θ∈
Rk
froma
Dirich
let(α)
–For
the
n-th
word
ofth
esam
edocu
men
t,
∗sam
ple
atop
iczn
froma
multin
omial
with
param
eterθ
∗sam
ple
aword
wn
froma
multin
omial
with
param
eterβ(z
n,:)
Topic
models
and
matrix
facto
rizatio
n
•Late
nt
Dirich
let
allo
catio
n(B
leiet
al.,2003)
–For
adocu
men
t,sam
ple
θ∈
Rk
froma
Dirich
let(α)
–For
the
n-th
word
ofth
esam
edocu
men
t,
∗sam
ple
atop
iczn
froma
multin
omial
with
param
eterθ
∗sam
ple
aword
wn
froma
multin
omial
with
param
eterβ(z
n,:)
•In
terp
reta
tion
as
multin
om
ialPCA
(Buntin
ean
dPerttu
,2003)
–M
arginalizin
gover
topic
zn,given
θ,each
word
wn
isselected
from
am
ultin
omial
with
param
eter∑
kz=
1θ
k β(z
,:)=
β⊤θ
–Row
ofβ
=diction
aryelem
ents,
θco
de
fora
docu
men
t
Topic
models
and
matrix
facto
rizatio
n
•Two
diff
ere
nt
view
son
the
sam
epro
ble
m
–In
teresting
parallels
tobe
mad
e
–Com
mon
problem
sto
be
solved
•Stru
cture
on
dictio
nary/
deco
mpositio
nco
effi
cients
with
adap
ted
priors,e.g.,
nested
Chin
eserestau
rant
processes
(Blei
etal.,
2004)
•O
ther
priorsan
dprob
abilistic
formulation
s(G
riffith
san
dGhah
raman
i,
2006;Salakh
utd
inov
and
Mnih
,2008;
Arch
ambeau
and
Bach
,2008)
•Id
entifi
ability
and
inte
rpre
tatio
n/eva
luatio
nofre
sults
•D
iscrimin
ative
task
s(B
leian
dM
cAuliff
e,2008;
Lacoste-Ju
lien
etal.,
2008;M
airalet
al.,2009b
)
•O
ptim
izatio
nand
loca
lm
inim
a
Sparsify
ing
linear
meth
ods
•Sam
epatte
rnth
an
with
kern
elm
eth
ods
–H
igh-d
imen
sional
inferen
cerath
erth
annon
-linearities
•M
aindiff
erence:
ingen
eralno
uniq
ue
way
•Sparse
CCA
(Srip
erum
buduret
al.,2009;
Hard
oon
and
Shaw
e-Taylor,
2008;A
rcham
beau
and
Bach
,2008)
•Sparse
LD
A(M
oghad
dam
etal.,
2006a)
•Sparse
...
Sparse
meth
ods
for
matrice
s
Sum
mary
•Stru
ctured
matrix
factorizationhas
man
yap
plication
s
•A
lgorithm
icissu
es
–D
ealing
with
largedatasets
–D
ealing
with
structu
redsp
arsity
•T
heoretical
issues
–Id
entifi
ability
ofstru
ctures,
diction
ariesor
codes
–O
ther
approach
esto
sparsity
and
structu
re
•N
on-con
vexop
timization
versus
convex
optim
ization
–Con
vexification
throu
ghunbou
nded
diction
arysize
(Bach
etal.,
2008;B
radley
and
Bagn
ell,2009)
-few
perform
ance
improvem
ents
Sparse
meth
ods
for
mach
ine
learn
ing
Outlin
e
•In
troductio
n-O
vervie
w
•Sparse
linear
estim
atio
nw
ithth
eℓ1 -n
orm
–Con
vexop
timization
and
algorithm
s
–T
heoretical
results
•Stru
cture
dsp
arsem
eth
ods
on
vecto
rs
–G
roups
offeatu
res/
Multip
lekern
ellearn
ing
–Exten
sions
(hierarch
icalor
overlappin
ggrou
ps)
•Sparse
meth
ods
on
matrice
s
–M
ulti-task
learnin
g
–M
atrixfactorization
(low-ran
k,sp
arsePCA
,diction
arylearn
ing)
Lin
ks
with
com
pre
ssed
sensin
g
(Bara
niu
k,2007;Candes
and
Wakin
,2008)
•G
oalof
compressed
sensin
g:recover
asign
alw
∈R
pfrom
only
n
measu
remen
tsy
=X
w∈
Rn
•A
ssum
ption
s:th
esign
alis
k-sp
arse,n
much
smaller
than
p
•A
lgorithm
:m
inw∈
Rp‖w‖
1su
chth
aty
=X
w
•Suffi
cient
condition
onX
and
(k,n
,p)
forperfect
recovery:
–Restricted
isometry
property
(allsm
allsu
bm
atricesof
X⊤X
must
be
well-con
dition
ed)
–Such
matrices
arehard
tocom
eup
with
determ
inistically,
but
random
ones
areO
Kw
ithk
logp
=O
(n)
•Random
Xfo
rm
ach
ine
learn
ing?
Why
use
sparse
meth
ods?
•Sparsity
asa
proxyto
interpretab
ility
–Stru
ctured
sparsity
•Sparse
meth
ods
arenot
limited
toleast-sq
uares
regression
•Faster
trainin
g/testing
•B
etterpred
ictiveperform
ance?
–Prob
lems
aresp
arseif
youlo
okat
them
the
right
way
–Prob
lems
aresp
arseif
youm
aketh
emsp
arse
Conclu
sion
-In
tere
sting
questio
ns/
issues
•Im
plicit
vs.exp
licitfeatu
res
–Can
we
algorithm
icallyach
ievelog
p=
O(n
)w
ithexp
licit
unstru
ctured
features?
•N
ormdesign
–W
hat
type
ofbeh
aviorm
aybe
obtain
edw
ithsp
arsity-inducin
g
norm
s?
•O
verfittin
gcon
vexity
–D
owe
actually
need
convexity
form
atrixfactorization
problem
s?
Hirin
gpostd
ocs
and
PhD
students
Euro
pean
Rese
archCouncil
projecton
Sparse
structu
red
meth
odsfo
rm
ach
ine
learn
ing
•PhD
position
s
•1-year
and
2-yearpostd
octoral
position
s
•M
achin
elearn
ing
(theory
and
algorithm
s),com
puter
vision,
audio
processin
g,sign
alpro
cessing
•Loca
ted
indow
nto
wn
Paris
(Ecole
Norm
aleSuperieu
re-
INRIA
)
•http://www.di.ens.fr/~fbach/sierra/
Refe
rence
s
J.A
bern
ethy,
F.B
ach,T
.Evg
enio
u,an
dJ.-P
.Vert.
Anew
appro
achto
collab
orativefilterin
g:
Operator
estimatio
nw
ithsp
ectralreg
ularizatio
n.
Journ
alofM
achin
eLearn
ing
Research
,10:8
03–826,2009.
Y.A
mit,
M.Fin
k,N
.Srebro
,an
dS.U
llman
.U
nco
vering
shared
structu
resin
multiclass
classificatio
n.
InPro
ceedin
gs
ofth
e24th
intern
ational
conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2007.
C.
Arch
ambeau
and
F.
Bach
.Sparse
probab
ilisticpro
jections.
InA
dvan
cesin
Neu
ralIn
formatio
n
Pro
cessing
System
s21
(NIP
S),
2008.
A.
Arg
yriou,
C.A
.M
icchelli,
and
M.
Pontil.
On
spectral
learnin
g.
Journ
alof
Mach
ine
Learn
ing
Research
,2009.
To
appear.
F.
Bach
.H
igh-D
imen
sional
Non-L
inear
Variab
leSelectio
nth
rough
Hierarch
icalK
ernel
Learn
ing.
Tech
nical
Rep
ort0909.0
844,arX
iv,2009a.
F.
Bach
.B
olasso
:m
odel
consisten
tlasso
estimatio
nth
rough
the
bootstrap
.In
Pro
ceedin
gs
of
the
Twen
ty-fifth
Intern
ational
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2008a.
F.
Bach
.Consisten
cyof
the
gro
up
Lasso
and
multip
lekern
ellearn
ing.
Journ
alof
Mach
ine
Learn
ing
Research
,9:1
179–1225,2008b.
F.
Bach
.Exp
loring
large
feature
spaces
with
hierarch
icalm
ultip
lekern
ellearn
ing.
InA
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s,2008c.
F.B
ach.
Self-co
ncord
ant
analysis
forlo
gistic
regressio
n.
Tech
nical
Rep
ort0910.4
627,A
rXiv,
2009b.
F.B
ach.Consisten
cyoftrace
norm
min
imizatio
n.Jo
urn
alofM
achin
eLearn
ing
Research
,9:1
019–1048,
2008d.
F.B
ach,G
.R.G
.Lan
ckriet,an
dM
.I.
Jordan
.M
ultip
lekern
ellearn
ing,
conic
duality,
and
the
SM
O
algorith
m.
InPro
ceedin
gs
ofth
eIn
ternatio
nal
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2004a.
F.B
ach,R.T
hib
aux,
and
M.I.
Jordan
.Com
putin
greg
ularizatio
npath
sfor
learnin
gm
ultip
lekern
els.
InA
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s17,2004b.
F.B
ach,J.
Mairal,
and
J.Ponce.
Convex
sparse
matrix
factorizations.
Tech
nical
Rep
ort0812.1
869,
ArX
iv,2008.
O.B
anerjee,
L.ElG
hao
ui,
and
A.d’A
spremont.
Model
selection
thro
ugh
sparse
maxim
um
likelihood
estimatio
nfor
multivariate
Gau
ssianor
bin
arydata.
The
Journ
alofM
achin
eLearn
ing
Research
,9:
485–516,2008.
R.B
araniu
k.Com
pressivesen
sing.
IEEE
Sig
nal
Pro
cessing
Mag
azine,
24(4
):118–121,2007.
R.G
.B
araniu
k,V
.Cevh
er,M
.F.D
uarte,
and
C.H
egde.
Model-b
asedco
mpressive
sensin
g.
Tech
nical
report,
arXiv:0
808.3
572,2008.
A.B
eckan
dM
.Teb
oulle.
Afast
iterativesh
rinkag
e-thresh
oldin
galg
orithm
forlin
earin
versepro
blem
s.
SIA
MJo
urn
alon
Imag
ing
Scien
ces,2(1
):183–202,2009.
D.B
ertsekas.N
onlin
earpro
gram
min
g.Ath
ena
Scien
tific,
1995.
P.B
ickel,Y
.Rito
v,an
dA
.T
sybako
v.Sim
ultan
eous
analysis
ofLasso
and
Dan
tzigselector.
Annals
of
Statistics,
37(4
):1705–1732,2009.
D.
Blei,
T.L
.G
riffith
s,M
.I.Jord
an,
and
J.B.
Ten
enbau
m.
Hierarch
icalto
pic
models
and
the
nested
Chin
eserestau
rant
process.
Advan
cesin
neu
ralin
formatio
npro
cessing
systems,
16:1
06,2004.
D.M
.B
leian
dJ.
McA
uliff
e.Supervised
topic
models.
InA
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s(N
IPS),
volu
me
20,2008.
D.M
.B
lei,A
.Y.
Ng,
and
M.I.
Jordan
.Laten
tdirich
letallo
cation.
The
Journ
alof
Mach
ine
Learn
ing
Research
,3:9
93–1022,2003.
J.F.B
onnan
s,J.
C.G
ilbert,
C.Lem
arechal,
and
C.A
.Sag
astizbal.
Num
ericalO
ptim
ization
Theoretical
and
Practical
Asp
ects.Sprin
ger,
2003.
J.M
.B
orwein
and
A.
S.
Lew
is.Convex
Analysis
and
Nonlin
earO
ptim
ization.
Num
ber
3in
CM
S
Books
inM
athem
atics.Sprin
ger-V
erlag,2000.
L.B
otto
uan
dO
.B
ousq
uet.
The
tradeo
ffs
oflarg
escale
learnin
g.
InA
dvan
cesin
Neu
ralIn
formatio
n
Pro
cessing
System
s(N
IPS),
volu
me
20,2008.
S.P.B
oydan
dL.Van
den
berg
he.
Convex
Optim
ization.
Cam
bridge
University
Press,
2004.
D.B
radley
and
J.D
.B
agnell.
Convex
codin
g.
InPro
ceedin
gs
ofth
ePro
ceedin
gs
ofth
eT
wen
ty-Fifth
Conferen
ceA
nnual
Conferen
ceon
Uncertain
tyin
Artifi
cialIn
telligen
ce(U
AI-0
9),
2009.
F.
Bunea,
A.B
.T
sybako
v,an
dM
.H.
Weg
kamp.
Aggreg
ation
forGau
ssianreg
ression.
Annals
of
Statistics,
35(4
):1674–1697,2007.
W.B
untin
ean
dS.Perttu
.Is
multin
om
ialPCA
multi-faceted
clusterin
gor
dim
ensio
nality
reductio
n.
In
Intern
ational
Worksh
op
on
Artifi
cialIn
telligen
cean
dStatistics
(AIS
TAT
S),
2003.
E.
Can
des
and
T.
Tao
.T
he
Dan
tzigselector:
statisticalestim
ation
when
pis
much
larger
than
n.
Annals
ofStatistics,
35(6
):2313–2351,2007.
E.Can
des
and
M.W
akin.A
nin
troductio
nto
com
pressivesam
plin
g.IE
EE
Sig
nal
Pro
cessing
Mag
azine,
25(2
):21–30,2008.
E.J.
Can
des
and
Y.Plan
.M
atrixco
mpletio
nw
ithnoise.
2009a.
Subm
itted.
E.J.
Can
des
and
Y.Plan
.N
ear-ideal
model
selection
byl1
min
imizatio
n.
The
Annals
ofStatistics,
37
(5A
):2145–2177,2009b.
F.Caro
nan
dA
.D
oucet.
Sparse
Bayesian
nonparam
etricreg
ression.
In25th
Intern
ational
Conferen
ce
on
Mach
ine
Learn
ing
(ICM
L),
2008.
S.S.Chen
,D
.L.D
onoho,an
dM
.A
.Sau
nders.
Ato
mic
deco
mpositio
nby
basis
pursu
it.SIA
MReview
,
43(1
):129–159,2001.
A.
d’A
spremont
and
L.
El
Ghao
ui.
Testin
gth
enullsp
acepro
perty
usin
gsem
idefi
nite
program
min
g.
Tech
nical
Rep
ort0807.3
520v5
,arX
iv,2008.
A.d’A
spremont,
ElL.G
hao
ui,
M.I.
Jordan
,an
dG
.R.G
.Lan
ckriet.A
direct
formulatio
nfor
sparse
PCA
usin
gsem
idefi
nite
program
min
g.
SIA
MReview
,49(3
):434–48,2007.
A.d’A
spremont,
F.B
ach,an
dL.ElG
hao
ui.
Optim
also
lutio
ns
forsp
arseprin
cipal
com
ponen
tan
alysis.
Journ
alofM
achin
eLearn
ing
Research
,9:1
269–1294,2008.
D.L
.D
onoho
and
J.Tan
ner.
Neig
hborlin
essof
random
lypro
jectedsim
plices
inhig
hdim
ensio
ns.
Pro
ceedin
gs
ofth
eN
ational
Acad
emy
ofScien
cesofth
eU
nited
States
ofA
merica,
102(2
7):9
452,
2005.
B.Efro
n,T
.H
astie,I.
Johnsto
ne,
and
R.T
ibsh
irani.
Least
angle
regressio
n.
Annals
of
statistics,32
(2):4
07–451,2004.
M.
Elad
and
M.
Aharo
n.
Imag
eden
oisin
gvia
sparse
and
redundan
trepresen
tations
over
learned
dictio
naries.
IEEE
Tran
sactions
on
Imag
ePro
cessing,15(1
2):3
736–3745,2006.
J.Fan
and
R.Li.
Variab
leSelectio
nV
iaN
onco
ncave
Pen
alizedLikelih
ood
and
ItsO
raclePro
perties.
Journ
alofth
eA
merican
Statistical
Asso
ciation,96(4
56):1
348–1361,2001.
M.
Fazel,
H.
Hin
di,
and
S.P
.B
oyd.
Aran
km
inim
ization
heu
risticw
ithap
plicatio
nto
min
imum
order
systemap
proximatio
n.
InPro
ceedin
gs
ofth
eA
merican
Contro
lConferen
ce,vo
lum
e6,pag
es
4734–4739,2001.
C.
Fevo
tte,N
.B
ertin,
and
J.-L.
Durrieu
.N
onneg
ativem
atrixfactorizatio
nw
ithth
eitaku
ra-saito
diverg
ence.
with
applicatio
nto
music
analysis.
Neu
ralCom
putatio
n,21(3
),2009.
J.Fried
man
,T
.H
astie,H
.H
”oflin
g,
and
R.
Tib
shiran
i.Path
wise
coord
inate
optim
ization.
Annals
of
Applied
Statistics,
1(2
):
302–332,2007.
W.
Fu.
Pen
alizedreg
ressions:
the
bridge
vs.th
eLasso
.Jo
urn
alof
Com
putatio
nal
and
Grap
hical
Statistics,
7(3
):397–416,1998).
T.G
riffith
san
dZ.G
hah
raman
i.In
finite
latent
feature
models
and
the
Indian
buffet
process.
Advan
ces
inN
eural
Inform
ation
Pro
cessing
System
s(N
IPS),
18,2006.
D.
R.
Hard
oon
and
J.Shaw
e-Taylor.
Sparse
canonical
correlation
analysis.
InSparsity
and
Inverse
Pro
blem
sin
Statistical
Theory
and
Eco
nom
etrics,2008.
T.J.
Hastie
and
R.J.
Tib
shiran
i.G
eneralized
Additive
Models.
Chap
man
&H
all,1990.
J.H
uan
gan
dT
.Zhan
g.
The
ben
efit
ofgro
up
sparsity.
Tech
nical
Rep
ort0901.2
962v2
,A
rXiv,
2009.
J.H
uan
g,
S.
Ma,
and
C.H
.Zhan
g.
Adap
tiveLasso
forsp
arsehigh
-dim
ensio
nal
regressio
nm
odels.
Statistica
Sin
ica,18:1
603–1618,2008.
J.H
uan
g,T
.Zhan
g,an
dD
.M
etaxas.Learn
ing
with
structu
redsp
arsity.In
Pro
ceedin
gs
of
the
26th
Intern
ational
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2009.
H.Ish
waran
and
J.S.Rao
.Spike
and
slabvariab
leselectio
n:
frequen
tistan
dB
ayesianstrateg
ies.T
he
Annals
ofStatistics,
33(2
):730–773,2005.
L.Jaco
b,G
.O
bozin
ski,an
dJ.-P
.Vert.
Gro
up
Lasso
with
overlap
san
dgrap
hLasso
.In
Pro
ceedin
gs
of
the
26th
Intern
ational
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2009.
R.Jen
atton,J.Y
.A
udib
ert,an
dF.B
ach.
Stru
ctured
variable
selection
with
sparsity-in
ducin
gnorm
s.
Tech
nical
report,
arXiv:0
904.3
523,2009a.
R.
Jenatto
n,
G.
Obozin
ski,an
dF.
Bach
.Stru
ctured
sparse
princip
alco
mponen
tan
alysis.Tech
nical
report,
arXiv:0
909.1
440,2009b.
A.
Juditsky
and
A.
Nem
irovski.
On
verifiab
lesu
fficien
tco
nditio
ns
forsp
arsesig
nal
recovery
via
ℓ1 -m
inim
ization.
Tech
nical
Rep
ort0809.2
650v1
,A
rXiv,
2008.
G.
S.
Kim
eldorf
and
G.
Wah
ba.
Som
eresu
ltson
Tch
ebycheffi
ansp
line
functio
ns.
J.M
ath.
Anal.
Applicat.,
33:8
2–95,1971.
S.
Laco
ste-Julien
,F.
Sha,
and
M.I.
Jordan
.D
iscLD
A:
Discrim
inative
learnin
gfor
dim
ensio
nality
reductio
nan
dclassifi
cation.
Advan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s(N
IPS)
21,2008.
G.R.G
.Lan
ckriet,T
.D
eB
ie,N
.Cristian
ini,
M.I.
Jordan
,an
dW
.S.N
oble.
Astatistical
framew
ork
forgen
om
icdata
fusio
n.
Bio
inform
atics,20:2
626–2635,2004a.
G.
R.
G.
Lan
ckriet,N
.Cristian
ini,
L.
ElG
hao
ui,
P.
Bartlett,
and
M.
I.Jord
an.
Learn
ing
the
kernel
matrix
with
semid
efinite
program
min
g.
Journ
alofM
achin
eLearn
ing
Research
,5:2
7–72,2004b.
H.
Lee,
A.
Battle,
R.
Rain
a,an
dA
.N
g.
Effi
cient
sparse
codin
galg
orithm
s.In
Advan
cesin
Neu
ral
Inform
ation
Pro
cessing
System
s(N
IPS),
2007.
Y.Lin
and
H.H
.Zhan
g.Com
ponen
tselectio
nan
dsm
ooth
ing
inm
ultivariate
nonparam
etricreg
ression.
Annals
ofStatistics,
34(5
):2272–2297,2006.
J.Liu
,S.Ji,
and
J.Ye.
Multi-T
askFeatu
reLearn
ing
Via
Effi
cientl2
,-Norm
Min
imizatio
n.
Pro
ceedin
gs
ofth
e25th
Conferen
ceon
Uncertain
tyin
Artifi
cialIn
telligence
(UA
I),2009.
K.
Lounici.
Sup-n
ormco
nverg
ence
ratean
dsig
nco
ncen
tration
property
of
Lasso
and
Dan
tzig
estimators.
Electro
nic
Journ
alofStatistics,
2:9
0–102,2008.
K.Lounici,
A.B
.T
sybako
v,M
.Pontil,
and
S.A
.van
de
Geer.
Takin
gad
vantag
eofsp
arsityin
multi-task
learnin
g.
InConferen
ceon
Com
putatio
nal
Learn
ing
Theory
(CO
LT),
2009.
J.Lv
and
Y.Fan
.A
unifi
edap
proach
tom
odel
selection
and
sparse
recovery
usin
greg
ularized
least
squares.
Annals
ofStatistics,
37(6
A):3
498–3528,2009.
L.
Mackey.
Defl
ation
meth
ods
forsp
arsePCA
.A
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s
(NIP
S),
21,2009.
N.M
aculan
and
G.J.R
.G
ALD
INO
DE
PAU
LA
.A
linear-tim
em
edian
-findin
galg
orithm
forpro
jecting
avector
on
the
simplex
ofR?n
?O
peratio
ns
researchletters,
8(4
):219–222,1989.
J.M
airal,F.B
ach,J.
Ponce,
G.Sap
iro,an
dA
.Zisserm
an.
Discrim
inative
learned
dictio
naries
forlo
cal
imag
ean
alysis.In
IEEE
Conferen
ceon
Com
puter
Visio
nan
dPattern
Reco
gnitio
n,(C
VPR),
2008.
J.M
airal,F.
Bach
,J.
Ponce,
and
G.
Sap
iro.
Onlin
edictio
nary
learnin
gfor
sparse
codin
g.
In
Intern
ational
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2009a.
J.M
airal,F.B
ach,J.
Ponce,
G.Sap
iro,an
dA
.Zisserm
an.
Supervised
dictio
nary
learnin
g.
Advan
ces
inN
eural
Inform
ation
Pro
cessing
System
s(N
IPS),
21,2009b.
J.M
airal,F.
Bach
,J.
Ponce,
G.
Sap
iro,
and
A.
Zisserm
an.
Non-lo
calsp
arsem
odels
forim
age
restoration.
InIn
ternatio
nal
Conferen
ceon
Com
puter
Visio
n(IC
CV
),2009c.
H.
M.
Markow
itz.T
he
optim
ization
of
aquad
raticfu
nctio
nsu
bject
tolin
earco
nstrain
ts.N
aval
Research
Logistics
Quarterly,
3:1
11–133,1956.
P.M
assart.Concen
tration
Ineq
ualities
and
Model
Selectio
n:
Eco
led’ete
de
Pro
bab
ilitesde
Sain
t-Flo
ur
23.
Sprin
ger,
2003.
R.
Mazu
mder,
T.
Hastie,
and
R.
Tib
shiran
i.Spectral
Reg
ularizatio
nA
lgorith
ms
forLearn
ing
Larg
e
Inco
mplete
Matrices.
2009.
Subm
itted.
N.M
einsh
ausen
.Relaxed
Lasso
.Com
putatio
nal
Statistics
and
Data
Analysis,
52(1
):374–393,2008.
N.
Mein
shau
senan
dP.
Buhlm
ann.
Hig
h-d
imen
sional
grap
hs
and
variable
selection
with
the
lasso.
Annals
ofstatistics,
34(3
):1436,2006.
N.M
einsh
ausen
and
P.B
uhlm
ann.
Stab
ilityselectio
n.
Tech
nical
report,
arXiv:
0809.2
932,2008.
N.M
einsh
ausen
and
B.Yu.
Lasso
-type
recovery
of
sparse
representatio
ns
forhig
h-d
imen
sional
data.
Annals
ofStatistics,
37(1
):246–270,2008.
C.A
.M
icchelli
and
M.
Pontil.
Learn
ing
the
kernel
functio
nvia
regularizatio
n.
Journ
alof
Mach
ine
Learn
ing
Research
,6(2
):1099,2006.
B.M
oghad
dam
,Y
.W
eiss,an
dS.A
vidan
.G
eneralized
spectral
bounds
forsp
arseLD
A.
InPro
ceedin
gs
ofth
e23rd
intern
ational
conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2006a.
B.
Moghad
dam
,Y
.W
eiss,an
dS.
Avid
an.
Spectral
bounds
forsp
arsePCA
:Exact
and
greed
y
algorith
ms.
InA
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s,vo
lum
e18,2006b.
R.M
.N
eal.B
ayesianlearn
ing
forneu
ralnetw
orks.Sprin
ger
Verlag
,1996.
Y.
Nestero
v.In
troductory
lectures
on
convex
optim
ization:
Abasic
course.
Kluw
erA
cadem
icPub,
2003.
Y.
Nestero
v.G
radien
tm
ethods
form
inim
izing
com
posite
objective
functio
n.
Cen
terfor
Operatio
ns
Research
and
Eco
nom
etrics(C
ORE),
Cath
olic
University
ofLouvain
,Tech
.Rep
,76,2007.
G.
Obozin
ski,M
.J.W
ainw
right,
and
M.I.
Jordan
.H
igh-d
imen
sional
unio
nsu
pport
recovery
in
multivariate
regressio
n.
InA
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s(N
IPS),
2008.
G.
Obozin
ski,B
.Taskar,
and
M.I.
Jordan
.Jo
int
covariate
selection
and
join
tsu
bsp
aceselectio
nfor
multip
leclassifi
cation
problem
s.Statistics
and
Com
putin
g,pag
es1–22,2009.
B.A
.O
lshau
senan
dD
.J.
Field
.Sparse
codin
gw
ithan
overco
mplete
basis
set:A
strategy
employed
byV
1?
Visio
nResearch
,37:3
311–3325,1997.
M.R.O
sborn
e,B
.Presn
ell,an
dB
.A
.Turlach
.O
nth
elasso
and
itsdual.
Journ
alofCom
putatio
nal
and
Grap
hical
Statistics,
9(2
):319–337,2000.
M.Pontil,
A.A
rgyrio
u,an
dT
.Evg
enio
u.M
ulti-task
feature
learnin
g.In
Advan
cesin
Neu
ralInform
ation
Pro
cessing
System
s,2007.
R.
Rain
a,A
.B
attle,H
.Lee,
B.
Packer,
and
A.Y
.N
g.
Self-tau
ght
learnin
g:
Tran
sferlearn
ing
from
unlab
eleddata.
InPro
ceedin
gs
ofth
e24th
Intern
ational
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2007.
A.Rako
tom
amonjy,
F.B
ach,S.Can
u,an
dY
.G
randvalet.
Sim
pleM
KL.
Journ
alofM
achin
eLearn
ing
Research
,9:2
491–2521,2008.
S.W
.Rau
den
bush
and
A.S
.B
ryk.H
ierarchical
linear
models:
Applicatio
ns
and
data
analysis
meth
ods.
Sag
ePub.,
2002.
P.Raviku
mar,
H.Liu
,J.
Laff
erty,an
dL.W
asserman
.SpA
M:Sparse
additive
models.
InA
dvan
cesin
Neu
ralIn
formatio
nPro
cessing
System
s(N
IPS),
2008.
B.Rech
t,W
.X
u,an
dB
.H
assibi.
Null
Space
Conditio
ns
and
Thresh
old
sfor
Ran
kM
inim
ization.
2009.
Subm
itted.
J.D
.M
.Ren
nie
and
N.Srebro
.Fast
maxim
um
marg
inm
atrixfactorizatio
nfor
collab
orativepred
iction.
InPro
ceedin
gs
ofth
e22nd
intern
ational
conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2005.
V.Roth
and
B.Fisch
er.T
he
gro
up-L
assofor
gen
eralizedlin
earm
odels:
uniq
uen
essof
solu
tions
and
efficien
talg
orithm
s.In
Pro
ceedin
gs
of
the
25th
Intern
ational
Conferen
ceon
Mach
ine
Learn
ing
(ICM
L),
2008.
R.Salakh
utd
inov
and
A.M
nih
.Pro
bab
ilisticm
atrixfactorizatio
n.
InA
dvan
cesin
Neu
ralIn
formatio
n
Pro
cessing
System
s,vo
lum
e20,2008.
B.Sch
olko
pfan
dA
.J.
Sm
ola.
Learn
ing
with
Kern
els.M
ITPress,
2001.
M.W
.Seeg
er.B
ayesianin
ference
and
optim
aldesig
nfor
the
sparse
linear
model.
The
Journ
alof
Mach
ine
Learn
ing
Research
,9:7
59–813,2008.
J.Shaw
e-Taylor
and
N.Cristian
ini.
Kern
elM
ethods
forPattern
Analysis.
Cam
bridge
University
Press,
2004.
S.
Sonnen
burg
,G
.Raetsch
,C.
Sch
aefer,an
dB
.Sch
oelko
pf.
Larg
escale
multip
lekern
ellearn
ing.
Journ
alofM
achin
eLearn
ing
Research
,7:1
531–1565,2006.
N.Srebro
,J.
D.M
.Ren
nie,
and
T.S.Jaakko
la.M
aximum
-marg
inm
atrixfactorizatio
n.
InA
dvan
ces
inN
eural
Inform
ation
Pro
cessing
System
s17,2005.
B.
K.
Srip
erum
budur,
D.
A.
Torres,
and
G.
R.
G.
Lan
ckriet.A
d.c.
program
min
gap
proach
toth
e
sparse
gen
eralizedeig
envalu
epro
blem
.Tech
nical
Rep
ort0901.1
504v2
,A
rXiv,
2009.
R.T
ibsh
irani.
Reg
ression
shrin
kage
and
selection
viath
elasso
.Jo
urn
alofT
he
Royal
Statistical
Society
Series
B,58(1
):267–288,1996.
S.A
.Van
De
Geer.
Hig
h-d
imen
sional
gen
eralizedlin
earm
odels
and
the
Lasso
.A
nnals
ofStatistics,
36
(2):6
14,2008.
E.
vanden
Berg
,M
.Sch
mid
t,M
.P.
Fried
lander,
and
K.
Murp
hy.
Gro
up
sparsity
vialin
ear-time
projectio
n.
Tech
nical
Rep
ortT
R-2
008-0
9,D
epartm
ent
of
Com
puter
Scien
ce,U
niversity
of
British
Colu
mbia,
2009.
G.W
ahba.
Splin
eM
odels
forO
bservatio
nal
Data.
SIA
M,1990.
M.
J.W
ainw
right.
Sharp
thresh
old
sfor
noisy
and
hig
h-d
imen
sional
recovery
of
sparsity
usin
gℓ1 -
constrain
edquad
raticpro
gram
min
g.
IEEE
transactio
ns
on
inform
ation
theory,
55(5
):2183,2009.
L.W
asserman
and
K.Roed
er.H
igh
dim
ensio
nal
variable
selection.
Annals
ofstatistics,
37(5
A):2
178,
2009.
D.M
.W
itten,
R.
Tib
shiran
i,an
dT
.H
astie.A
pen
alizedm
atrixdeco
mpositio
n,
with
applicatio
ns
to
sparse
princip
alco
mponen
tsan
dcan
onical
correlation
analysis.
Bio
statistics,10(3
):515–534,2009.
T.
T.
Wu
and
K.
Lan
ge.
Coord
inate
descen
talg
orithm
sfor
lassopen
alizedreg
ression.
Annals
of
Applied
Statistics,
2(1
):224–244,2008.
J.Yan
g,
K.
Yu,
Y.
Gong,
and
T.
Huan
g.
Lin
earsp
atialpyram
idm
atchin
gusin
gsp
arseco
din
gfor
imag
eclassifi
cation.
InIE
EE
Conferen
ceon
Com
puter
Visio
nan
dPattern
Reco
gnitio
n(C
VPR),
2009.
M.Yuan
and
Y.Lin
.M
odel
selection
and
estimatio
nin
regressio
nw
ithgro
uped
variables.
Journ
alof
The
Royal
Statistical
Society
Series
B,68(1
):49–67,2006.
M.Yuan
and
Y.Lin
.O
nth
enon-n
egative
garro
tteestim
ator.Jo
urn
alofT
he
Royal
Statistical
Society
Series
B,69(2
):143–161,2007.
T.Zhan
g.A
dap
tiveforw
ard-b
ackward
greed
yalg
orithm
forsp
arselearn
ing
with
linear
models.
Advan
ces
inN
eural
Inform
ation
Pro
cessing
System
s,22,2008a.
T.
Zhan
g.
Multi-stag
eco
nvex
relaxation
forlearn
ing
with
sparse
regularizatio
n.
Advan
cesin
Neu
ral
Inform
ation
Pro
cessing
System
s,22,2008b.
T.Zhan
g.
On
the
consisten
cyoffeatu
reselectio
nusin
ggreed
yleast
squares
regressio
n.
The
Journ
al
ofM
achin
eLearn
ing
Research
,10:5
55–568,2009.
P.Zhao
and
B.Yu.
On
model
selection
consisten
cyofLasso
.Jo
urn
alofM
achin
eLearn
ing
Research
,
7:2
541–2563,2006.
P.Zhao
,G
.Roch
a,an
dB
.Yu.
Gro
uped
and
hierarch
icalm
odel
selection
thro
ugh
com
posite
abso
lute
pen
alties.A
nnals
ofStatistics,
37(6
A):3
468–3497,2009.
H.Zou.
The
adap
tiveLasso
and
itsoracle
properties.
Journ
alofth
eA
merican
Statistical
Asso
ciation,
101(4
76):1
418–1429,2006.
H.Zou
and
T.H
astie.Reg
ularizatio
nan
dvariab
leselectio
nvia
the
elasticnet.
Journ
alof
the
Royal
Statistical
Society
Series
B(S
tatisticalM
ethodolo
gy),
67(2
):301–320,2005.
H.
Zou
and
R.
Li.
One-step
sparse
estimates
innonco
ncave
pen
alizedlikelih
ood
models.
Annals
of
Statistics,
36(4
):1509–1533,2008.