parametric submodular optimization in machine … submodular optimization in machine learning...

Parametric Submodular Optimization in Machine Learning

YOSHINOBU KAWAHARA

Trends in Machine Learning @ Kyoto Univ. (17 Mar., 2014)

Email: [email protected] web: http://www.ar.sanken.osaka-u.ac.jp/~kawahara/jp/

References

  Today’s talk is mainly about –  “Minimum average cost clustering” in Advances in Neural

Information Processing Systems 23 (Proc. of NIPS’10), –  “Size-constrained submodular minimization through minimum

norm base,” in Proc. of ICML’11, and –  “Structured convex optimization under submodular constraints,”

in Proc. of UAI’13.

2

You can download the slides from

http://www.ar.sanken.osaka-u.ac.jp/~kawahara/jp/20130317tml.pdf

Outline

  Brief review on a submodular function.

  Illustrative example of parametric submodular minimization (PSM) (densest k-subgraph problem).

  Structure of PSM and connection to the minimum-norm-point problem.

  Application to structured sparse coding.

  Fast implementation and some extensions.

3

Submodularity (Convexity in Set Function)

  Submodularity is the structure in a set function corresponding to convexity of a continuous function –  Has useful properties, such as the equivalence of the global

and local optimality, duality and the separation theorem.

4

Convexity（Cont. func.） Submodularity (Set func.)

f(S) + f(T ) f(S \ T ) + f(S [ T )

(S, T V)

1

2

3 4 5

6 7 Subset

fV : Set func.

R

Corr. concepts

Useful to develop efficient algorithms

Submodularity (Convexity in Set Function)

  A set func. that satisfies the following inequality equation is called a submodular function (Lovász (1984)):

5

large imp.

small imp.

diminishing property

f(S + i) f(S) f(T + i) f(T )

i

(S T V , i 2 V \ T )

Submodularity (Convexity in Set Function) 6

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

Y2

‘large’ imp.

Yi

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

Location S = Y1,Y2

Y1 Y2 Y1

Y3

Y4

Y5

* From IJCAI09 tutorial “Intelligent Information Gathering and Submodular Function Optimization”

Location S = Y1,Y2,Y3,Y4,Y5

‘small’ imp.

f(S) := V ar(;) V ar(S) V

f(S + i) f(S) f(T + i) f(T ) i

Variance reduction:

Variance of observation noise

Locations where sensors can be placed

:

Other Submodular Functions

  Submodular functions are found in a variety of fields: –  Information theory:

(Joint) entropy, (symmetric) mutual information, information gain, variance reduction, etc.

–  Graph theory: Cut function, cut capacity in network, rank of matroid etc.

–  Others: Utility function (supermodular), convex game, coverage function, determinant of matrix, multiple correlation coeff. (negative squared loss), etc.

7

Example (Submodular Function) 8

Cut function：

Coverage function：

1

2

3

4

5 f(S) =X

ce : e 2 E(S,V \ S) (S V)f(1, 2) = 3

Weight on edge e

Set of edges whose endpoints are included in and S V \ S

f(1, 2) = 8

f(S) =X

cu : u 2 [i2SSi (S V)

Weight on each point

V = S1, S2, S3

Other Definitions

  For any , it satisfies

  The Lovász extension (to be described) is convex.

  Other equivalent definitions are also known (cf. (Lovász,1983), (Fujishige,2005), (Bach,2013) etc.)

9

f(S) + f(T ) f(S \ T ) + f(S [ T )

S, T V

f : RV ! R

*) If (-f) is a submodular func., f is called a supermodular func.

Submodular Optimization

  As well as a continuous function, a submodular function can be minimized efficiently.

10

Discrete Convex Optimization!  As well as a continuous function, a submodular set-function

or other discrete functions can be minimized efficiently.

Continuous Domain Set-function Domain

Submo. set-func. min.Convex Min.

Convex Max. D.C. Problem

Submodular Max. Disc. D.C. Problem (Const.) Submo. Min.

Difficulty of prob.

Clustering Structure Learning

Densest Subgraph

Feature Selection Active LearningMaxCut

Energy Minimization, Discriminative Learning

Outline






11

Example (Densest k -subgraph) [Densest k-subgraph problem]

–  Find a subgraph with k nodes with maximal intensity:

12

max

SVI(S) s.t. |S| = k

Intensity (Supermodular) i.e., sum of weights on edges whose endpoints are included in

This problem is NP-hard.

S

I(2, 3, 6) = 2

1 2

5 6

341 1

1

1

3

2

2


13

max

SVI(S) s.t. |S| = k ⇒ （NP hard problem）

⇒ Polynomially solvable (by a parametric flow algo.)

1 2

5 6

341 3

2

2

3

2

2

These are exact solutions of (P1) for k =0, 3, 4 & 6.

(P1)

1, 5, 6 1, 2, 5, 6 V↵

0 1 3

(P2) minSV

I(S) ↵|S| (8↵ 0)


14

max

SVI(S) s.t. |S| = k ⇒ （NP hard problem） (P1)

We can find a inclusive part of solutions of (P1) by solving (P2).

⇒ Polynomially solvable (by a parametric flow algo.)

(P2) minSV

I(S) ↵|S| (8↵ 0)

Example (Densest k -subgraph)   Empirical example: Exact densest k-subgraphs in a huge

graph can be found for a short time.

15

# of nodes of found sub-graphs # of nodes of found sub-graphs

# of

edg

es in

sub

-gra

phs

# of

edg

es in

sub

-gra

phs

Amazon’s customer graph (735,323 nodes, 5,158,388 edges)

DBLP co-authors graph (326,186 nodes, 1,615,400 edges)

(19.68 seconds) (127.51 seconds)

More Generally… 16

Constrained Submodular Min. (NP hard)

Parametric Submodular Min. (PSM) (Polynomially solvable)

(; =)S0 S1 · · · Sl (= V)

Each is an exact solution of (P1) for some k.

(P1) (P2)

minSV

f(S) s.t. |S| = k

Similar techniques are useful for many problems in ML, such as

− Proximal gradient method (structured regularization) − Clustering and so on.

Submodular

minSV

f(S) ↵|S| (8↵ 0)

Outline



  Structure of PSM and its relation to the minimum-norm-point (MNP) problem. –  Connection between the constrained problem and PSM. –  Connection between the MNP problem and PSM.



17

Structure of Constrained Problem

  Consider the following problem:

18

Lemma (Nagano et al., 2011)

For , is an optimal solution to the size-constrained

problem w.r.t. .

This lemma can be proven because and

2 [0, |V|) S()

k = |S()|

k >

f(S())

k f(S)

k |S| = kfor any subset with . S

minSV

f(S) s.t. |S| = k

S() := argminSV

f(S)|S|

: |S| >

Structure of Constrained Problem

  Consider the following problem:

19

S() := argminSV

f(S)|S|

: |S| >

h(↵) = minSV

f(S) ↵ |S| ↵() = max↵ 2 R : ↵ h(↵)

↵() : = min

SV

f(S)|S|

: |S| >

= max↵ 2 R0 : ↵ f(S)|S|

, 8S with |S| >

= max↵ 2 R0 : ↵ f(S) ↵ |S|, 8S

Structure of Constrained Problem 20

h(↵) = minSV

f(S) ↵ |S| ↵() = max↵ 2 R : ↵ h(↵)

is found by solving ↵↵()

↵ = h(↵)

h(↵)

0

↵

h(↵)

↵()

S0

Each linear part of corresponds to solutions of the size-constrained problem

S1

S2

S3

f(S1) ↵|S1|

f(S2) ↵|S2|

f(S3) ↵|S3|

Base Polyhedron

  For a submodular function f with , the base polyhedron is defined by

21

x1

x2

B(f)

|V| = 2 |V| = 3

x1

x2

B(f)

x3

f(;) = 0

B(f) = x 2 RV : x(S) f(S),x(V) = f(V).

22 2 = 2 inequiality 23 2 = 6

1 equiality

ineq. 1 equiality

Minimum-Norm-Point Problem   The minimum-norm-point (MNP) problem:

22

minx2B(f)

X

i2V

x

2i

bi

x2 / b2

B(f)

x2 / b1

Base polyhedron

x*

Efficiently solvable by the minimum-norm-point (MNP) algorithm (Fujishige, 1980)

|V| = 2

Minimum-Norm-Point Problem   The parametric submodular problem and the MNP problem

are essentially equivalent:

23

minSV

f(S) + ↵ · b(S) (8↵ 0)

Sequence of solutions:

(; =)S0 S1 · · · Sl (= V)x

= argminx2B(f)

X

i2V

x

2i

bi

Minimum norm base:

1. Let be the distinct values of .

2. And .

1 < 2 < · · · < m x

S0 = ; Sj = i 2 V : xi j

Minimum-Norm-Point Problem   The parametric submodular problem and the MNP problem

are essentially equivalent:

24

minSV

f(S) + ↵ · b(S) (8↵ 0)

Sequence of solutions:

(; =)S0 S1 · · · Sl (= V)x

= argminx2B(f)

X

i2V

x

2i

bi

Minimum norm base:

x

i =

f(Sj+1) f(Sj)

b(Sj+1 \ Sj)· bi

For each , i 2 Sj+1 \ Sj (j = 1, . . . , l)

Summary of the point so far 25

minx2B(f)

X

i2V

x

2i

bi

minSV

f(S) s.t. |S| = k

minSV

f(S) + ↵ · b(S) (8↵ 0)

Minimum-norm-point problem Parametric submodular minimization

Equivalent

Size-constrained submodular minimization

(; =)S0 S1 · · · Sl (= V)give exact solutions w.r.t. some k

Outline






26

Structured Regularization 27

  Structured regularization allows us to incorporate prior information on combinatorial structures in variables into learning processes.

Graph structure Group structure

Hierarchical structure

Sparsity patterns induced for L(w) + (w)

Lasso: (w) =P

i |wi |

Group Lasso (Yuan and Lin, 2006): (w) =P

g2G kwgk

Group Lasso when groups overlap: (w) =P

g2G kwgk

The support obtained is

An intersection of the complements of the groups set to 0 (cf. Jenatton et al.(2009))

Not a union of groups

Sparsity tutorial II, ECML 2010, Barcelona 36/69

Loss func. Learning model Model parameters （ are indices）

minw2RV

1

n

nX

i=1

l(yi, g(xi;w)) + · (w)

Loss for learning task

Incorporate these combinatorial structures with regularization

Structured regul.

V

And, paths on a directed graph, block structures in a 2-d grids etc.

Example (Background Substraction) (1)   Use the adjacency structure in an image (Mairal+ 2011)：

28

Test Image（y） Estimated background

Estimate the background

mina2RN ,e2Rd

1

2ky Xa ek22 + (a)

… …

Train video sequence（X）

（N frames）

  Use the adjacency structure in an image (Mairal+ 2011)：

Example (Background Substraction) (2) 29

L1 regularization (Olshausen & Field, 1996)

CONVEX AND NETWORK FLOW OPTIMIZATION FOR STRUCTURED SPARSITY

(a) Original frame. (b) Estimated background with Ω. (c) ℓ1, 87.1%.

(d) ℓ1+ Ω (non-overlapping), 96.3%. (e) ℓ1+Ω (overlapping), 98.9%. (f) Ω, another frame.

(g) Original frame. (h) Estimated background with Ω. (i) ℓ1, 90.5%.

(j) ℓ1+ Ω (non-overlapping), 92.6%. (k) ℓ1+Ω (overlapping), 93.8%. (l) Ω, another frame.

Figure 4: Background subtraction results. For two videos, we present the original image y, theestimated background (i.e., Xw) reconstructed by our method, and the foreground (i.e., the sparsitypattern of e as a mask on the original image) detected with ℓ1, ℓ1+ Ω (non-overlapping groups) andwith ℓ1+Ω. Figures (f) and (l) present another foreground found with Ω, on a different image, withthe same values of λ1,λ2 as for the previous image. Best seen in color.

2703

CONVEX AND NETWORK FLOW OPTIMIZATION FOR STRUCTURED SPARSITY

(a) Original frame. (b) Estimated background with Ω. (c) ℓ1, 87.1%.

(d) ℓ1+ Ω (non-overlapping), 96.3%. (e) ℓ1+Ω (overlapping), 98.9%. (f) Ω, another frame.

(g) Original frame. (h) Estimated background with Ω. (i) ℓ1, 90.5%.

(j) ℓ1+ Ω (non-overlapping), 92.6%. (k) ℓ1+Ω (overlapping), 93.8%. (l) Ω, another frame.

Figure 4: Background subtraction results. For two videos, we present the original image y, theestimated background (i.e., Xw) reconstructed by our method, and the foreground (i.e., the sparsitypattern of e as a mask on the original image) detected with ℓ1, ℓ1+ Ω (non-overlapping groups) andwith ℓ1+Ω. Figures (f) and (l) present another foreground found with Ω, on a different image, withthe same values of λ1,λ2 as for the previous image. Best seen in color.

2703

Group regularization

Groups are given as overlapping (3×3) patches

(98.9%) (87.1%)

(w) =X

g2Gkwgk2

Lovász extension (1)   Continuous relaxation for set functions (Lovász, 1983)

30

Def.) Given any real vector , let denote its distinct values by . Then, the Lovász ext. is defined by

,

where .

p 2 Rn

p1 > p2 > · · · > pmf(p) =

Pm1k=1 (pk pk+1)f(Uk) + pmf(Um)

Uk = i 2 V : pi pk

A set function defined on a finite set is submodular if and only if its Lovász ext. is convex.

Theorem (Lovász, 1983)

Vf : 2V ! Rf : R|V | ! R

Lovász ext. f is convex Set func. f is submodular equivalent

Lovász extension (2)   Continuous relaxation for set functions (Lovász, 1983)

31

For , 1. Align in decreasing order 2. From the definition

p1 = 0.6 > p2 = 0.2

Ex.） |V| = 2. f(;) = 0, f(1) = 0.8, f(2) = 0.5, f(V) = 0.2

(U1 = 2, U2 = 1, 2)

p = (0.2, 0.6)

f(p) =(0.6 0.2) f(2)+ 0.2 f(V) = 0.24

Structured Regularization

  Many of known structured norms are the Lovász extensions of submodular functions (Bach, NIPS’10-11).

32

minw2RV

1

n

nX

i=1

l(yi, g(xi;w)) + · (w)

minw2RV

1

n

nX

i=1

l(yi, g(xi;w)) + · f(w)

If is a structured regularization

Lovász ext. of a submodular func.

(w)

Ex. of Structured Reg. by the Lovász Ext.

(Generalized) Fused Regularization： Regularize variables such that neighboring ones on a graph , where each node corresponds to a variable, tend to be similar.

33

Values of a neighboring variables become similar.

G = (E ,V)

(w) =X

(i,j)2E

aij |wi wj |

（Generalized）Fused regularization：

= （Equivalent）

The Lovász extension of a cut function:

f(S) =X

aij : i 2 S, j 2 V \ S

Element of the adjacent matrix

Ex. of Structured Reg. by the Lovász Ext.

Group Regularization： Given a set of groups , where each element is a subset of , it regularizes variables such that variables in each group tends to be zeros simultaneously.

34

Sparsity patterns induced for L(w) + (w)

Lasso: (w) =P

i |wi |

Group Lasso (Yuan and Lin, 2006): (w) =P

g2G kwgk

Group Lasso when groups overlap: (w) =P

g2G kwgk

The support obtained is

An intersection of the complements of the groups set to 0 (cf. Jenatton et al.(2009))

Not a union of groups

Sparsity tutorial II, ECML 2010, Barcelona 36/69

Ones in a group tend to be zeros simultaneously

GV

（L∞） Group regularization：

=

The Lovász extension of a coverage func.：

f(S) =X

dg : g 2 G, g \ S 6= ;

(w) =X

g2Gdgkwgk1

（Equivalent）

Proximal Gradient Method

  For the optimization in structured regularized learning, the proximal gradient method is usually applied because the objective function is not differentiable.

35

minw2RV

1

n

nX

i=1

l(yi, g(xi;w)) + · f(w)

In-differentiable convex Differentiable convex

Update in the gradient descent method:

Gradient method Proximal method

Reduced to minw2Rd

1

2kuwk22 + · f(w) (u 2 Rd)

Reduction to the MNP Problem 36

  Calculation of the proximal operator is equivalent to the one of the MNP problem (a kind of duality) (Bach, 2013)：

min

w2Rd

1

2

kuwk22 + · ˆf(w) = min

w2Rdmax

s2B(f)

1

2

kuwk22 + ·w>s

= max

s2B(f)min

w2Rd

1

2

kuwk22 + ·w>s

= max

s2B(f)

1

2

kuk22 1

2

k · s uk22

Base polyhedron

The MNP problem for the base polyhedron of . ⇒ Solvable by the MNP algo.

mint2B(f1u)

ktk22 f(S) 1u(S)

w = t

（Defi. of the Lovász）

Outline






37

Efficiently Solvable Subclass

  Cut function (quadratic submodular function) and generalized graph-cut function are subclasses that can be (parametrically) minimized much more efficiently.

38

Cut func.

（：# of edges，：func. evaluation）

Submodular func. O(|V|5EO+ |V|6)

O(|V|m log(|V|2/m))

etc. (Goldberg & Tarjan, 1986)

(Orlin 2009)

Generalized graph-cut func.

Cost for minimization (equivalently, for para. min.)

*) The minimum-norm-point algorithm (Fujishige+ 2006) runs more efficiently in practice with unknown theoretical cost.

Generalized Graph-Cuts (1)   More general class of function that can be minimized with

a maximum flow algorithm (Jegelka+ 2011, Nagano & Kawahara, 2013)

39

f(S) = minAU

Xce : e 2 outG (s [ S [A)

s t

1

2

3

u1 u2 u3

V

Set of edges heading from nodes Capacity on each edge

U

Many of submodular funcs. that appear in practice are in this form (including the structured norms described above).

Equivalent to a cut function with additional nodes U

Generalized Graph-Cuts (2)   If function f is a generalized graph-cut function, then the

parametric submodular optimization can be solved via a parametric flow algorithm（(Gallo+ 1989) etc.）：

40

minSV

f(S) ↵ · b(S)

s t

1

2

3

u1 u2 u3

V

U

Cost is same with the maxflow

↵ b1

↵ b2

↵ b3

O(|V [ U|m log(|V [ U|2/m))

Empirical Example

  Comparison of computational time (DA: para. optim., MNP: min-norm-point, Homo.: homotopy algo., NFA: network-flow):

41

Owing to the monotonicity of G−α in α ≥ 0, we can

utilize the parametric minimum cut algorithm [10] (seealso [15]). As a result, the problem (12) can be solved

in O((n+ |U |)|E| log (n+|U |)2|E| ) time.

In view of Theorem 1, we can solve the convex min-imization problem under constraints with respect tothe structured submodular function γ,

minx

!i∈V

wi(xi) : x ∈ B(γ)

in O(n(n+ |U |)|E|) or O((n+ |U |)|E| log (n+|U |)2|E| ) time

for a number of separable convex objective functions.

7 Experimental results

We investigated the empirical performance of the pro-posed scheme using synthetic and real-world datasets.In Section 7.1, we compare the proposed method in theapplication to proximal methods for structured regu-larized least-squares regression, with the state-of-the-art algorithms. In Section 7.2, we apply the proposedalgorithm to the densest subgraph problem for largereal web-network data. The experiments below wererun on a 2.3 GHz 64-bit workstation using Matlab withMex implementations. And we used SPAMS (SPArseModeling Software)2 for the implementations of theproximal methods for the first experiment.

7.1 Comparison in proximal methods

In the first experiment, we compared the proposed al-gorithm in the application to proximal methods withthe state-of-the-art algorithms. As for the regulariza-tion term, we used fused-regularization Ωfused(β) andgroup regularization (with l∞-norm) Ωgroup(β) (for agiven set of groups G), respectively represented as

Ωfused(β) =!n−1

i=1 |βi − βi+1| and

Ωgroup(β) =!

g∈Gdg∥βg∥∞,

where dg is the weight of the group g. As the com-parison partners, we used the proximal methods forthe above regularization; the one based on the ho-motopy algorithm for Ωfused(β) [6] (Homo.) and theone by Mairal et al. [17] for Ωgroup(β) (NFA), as wellas the minimum-norm-point algorithm (MNP) for thecalculation of the proximal operator. Since both reg-ularizations can be represented as the decomposablesubmodular function, we applied the parametric flowalgorithm for computing the proximal operators (DA).

We generated data as follows. First for the evaluationwith fused regularization, one feature is first selected

2http://spams-devel.gforge.inria.fr

Table 1: Comparison of running time (seconds) for theproposed and existing methods.

n N k DA MNP Homo.

500 500 20 0.024 5.083 0.084500 1,000 20 0.062 146.969 0.531500 5,000 20 1.085 — 32.676

1,000 500 20 0.019 3.891 0.0581,000 1,000 20 0.059 98.310 0.2661,000 5,000 20 1.064 — 12.372

n N k DA MNP NFA

500 500 ∼20 0.021 8.910 0.015500 1,000 ∼20 0.056 280.117 0.052500 5,000 ∼20 1.091 — 1.112

1,000 500 ∼20 0.020 6.108 0.0151,000 1,000 ∼20 0.054 198.010 0.0511,000 5,000 ∼20 1.003 — 0.896

randomly and the next one is selected with probability0.4 from each neighboring feature or with probability0.2/(N − 2) from the remaining ones and repeat thisprocedure until k features are selected. For group reg-ularization, the features are covered by 20–200 overlap-ping groups of size 15. The causal features are chosento be the union of 2 of these groups. Here, we assignweights dg = 2 to those causal groups and dg = 1to all other groups. We then simulate N data points

(x(i), y(i)), with y(i) = β⊤x(i) + ϵ, ϵ ∼ N (0,σ2),

where β is 0 for non-causal features and normally dis-tributed otherwise.

Since all methods calculate the same objectives in prin-ciple, here we report only the comparison of the em-pirical runtime. Tables 1 show the runtime by thealgorithms for reaching the duality-gap within 10−4,averaged over 20 datasets each. We can see that thealgorithms based on the parametric-flow algorithm, in-cluding ours, run much faster than the others. Notethat our scheme can be applied to more general formof structured regularization (Eq. (8) for the graph cutimplementation) than Ωfused(β) and Ωgroup(β).

7.2 Densest subgraphs in web graphs

In the second experiment, we applied the proposed al-gorithm to the densest subgraph problem using publicweb-graph and social-network datasets.3 The charac-teristics of each data are shown in Table 2. Althoughthe minimum-norm-point algorithm was applied to thesame problem on one of the datasets (cnr-2000) in [21],the data was sub-sampled to 5,000 nodes due to itscomputational cost. However, in this experiment, weused the full datasets for the analyses, which was possi-ble because our framework runs much more efficiently

3http://law.dsi.unimi.it/datasets.php

Owing to the monotonicity of G−α in α ≥ 0, we can

utilize the parametric minimum cut algorithm [10] (seealso [15]). As a result, the problem (12) can be solved

in O((n+ |U |)|E| log (n+|U |)2|E| ) time.

In view of Theorem 1, we can solve the convex min-imization problem under constraints with respect tothe structured submodular function γ,

minx

!i∈V

wi(xi) : x ∈ B(γ)

in O(n(n+ |U |)|E|) or O((n+ |U |)|E| log (n+|U |)2|E| ) time

for a number of separable convex objective functions.

7 Experimental results

We investigated the empirical performance of the pro-posed scheme using synthetic and real-world datasets.In Section 7.1, we compare the proposed method in theapplication to proximal methods for structured regu-larized least-squares regression, with the state-of-the-art algorithms. In Section 7.2, we apply the proposedalgorithm to the densest subgraph problem for largereal web-network data. The experiments below wererun on a 2.3 GHz 64-bit workstation using Matlab withMex implementations. And we used SPAMS (SPArseModeling Software)2 for the implementations of theproximal methods for the first experiment.

7.1 Comparison in proximal methods

In the first experiment, we compared the proposed al-gorithm in the application to proximal methods withthe state-of-the-art algorithms. As for the regulariza-tion term, we used fused-regularization Ωfused(β) andgroup regularization (with l∞-norm) Ωgroup(β) (for agiven set of groups G), respectively represented as

Ωfused(β) =!n−1

i=1 |βi − βi+1| and

Ωgroup(β) =!

g∈Gdg∥βg∥∞,

where dg is the weight of the group g. As the com-parison partners, we used the proximal methods forthe above regularization; the one based on the ho-motopy algorithm for Ωfused(β) [6] (Homo.) and theone by Mairal et al. [17] for Ωgroup(β) (NFA), as wellas the minimum-norm-point algorithm (MNP) for thecalculation of the proximal operator. Since both reg-ularizations can be represented as the decomposablesubmodular function, we applied the parametric flowalgorithm for computing the proximal operators (DA).

We generated data as follows. First for the evaluationwith fused regularization, one feature is first selected

2http://spams-devel.gforge.inria.fr

Table 1: Comparison of running time (seconds) for theproposed and existing methods.

n N k DA MNP Homo.

500 500 20 0.024 5.083 0.084500 1,000 20 0.062 146.969 0.531500 5,000 20 1.085 — 32.676

1,000 500 20 0.019 3.891 0.0581,000 1,000 20 0.059 98.310 0.2661,000 5,000 20 1.064 — 12.372

n N k DA MNP NFA

500 500 ∼20 0.021 8.910 0.015500 1,000 ∼20 0.056 280.117 0.052500 5,000 ∼20 1.091 — 1.112

1,000 500 ∼20 0.020 6.108 0.0151,000 1,000 ∼20 0.054 198.010 0.0511,000 5,000 ∼20 1.003 — 0.896

randomly and the next one is selected with probability0.4 from each neighboring feature or with probability0.2/(N − 2) from the remaining ones and repeat thisprocedure until k features are selected. For group reg-ularization, the features are covered by 20–200 overlap-ping groups of size 15. The causal features are chosento be the union of 2 of these groups. Here, we assignweights dg = 2 to those causal groups and dg = 1to all other groups. We then simulate N data points

(x(i), y(i)), with y(i) = β⊤x(i) + ϵ, ϵ ∼ N (0,σ2),

where β is 0 for non-causal features and normally dis-tributed otherwise.

Since all methods calculate the same objectives in prin-ciple, here we report only the comparison of the em-pirical runtime. Tables 1 show the runtime by thealgorithms for reaching the duality-gap within 10−4,averaged over 20 datasets each. We can see that thealgorithms based on the parametric-flow algorithm, in-cluding ours, run much faster than the others. Notethat our scheme can be applied to more general formof structured regularization (Eq. (8) for the graph cutimplementation) than Ωfused(β) and Ωgroup(β).

7.2 Densest subgraphs in web graphs

In the second experiment, we applied the proposed al-gorithm to the densest subgraph problem using publicweb-graph and social-network datasets.3 The charac-teristics of each data are shown in Table 2. Althoughthe minimum-norm-point algorithm was applied to thesame problem on one of the datasets (cnr-2000) in [21],the data was sub-sampled to 5,000 nodes due to itscomputational cost. However, in this experiment, weused the full datasets for the analyses, which was possi-ble because our framework runs much more efficiently

3http://law.dsi.unimi.it/datasets.php

Fused- regularization:

Group regularization:

Outline






42

Multi-way Clustering (1)   K-way clustering:

–  This problem is known to be a NP-hard problem.

–  This problem has the similar structure with the size-constrained submodular minimization with

43

minP

f(P)(=Pk

i=1f(Si))

s.t. |P| = k

P = S1, . . . , SkPartition:

fNon-negative submo. func.:

↵() := min

P

P

|P| : P is a partition of V, |P| >

Multi-way Clustering (2)   K-way clustering:

44

minP

f(P)(=Pk

i=1f(Si))

s.t. |P| = k

↵

h(↵)

0

↵

h(↵)

↵()

P = S1, . . . , SkPartition:

fNon-negative submo. func.:

f(P3) ↵|P3|

h(↵) = minP

f(P) ↵|P|P0

P1

P2

P3

f(P2) ↵|P2|

f(P1) ↵|P1|

Solvable by the decomposition algo.

Multi-way Clustering (3)   Empirical example:

–  Three cycles with different radii with a line (310 samples).

45

↵ = 0.87 ↵ = 3.22 ↵ = 4.90

↵

Separable Convex Minimization

  More general in-differentiable convex problems can be solved via the MNP problem (Nagano & Kawahara, 2013).

  Also, more general class of optimization on the base polyhedron can be reduced to the MNP problem.

46

minx2B(f)

X

i2V

x

2i

biminw2Rd

X

i2V (wi) + f(w)

Convex func.

（in-differentiable convex problem）

Lovasz ext. of a submodular func.

minx2B(f)

X

i2V

x

2i

bi

differentiable convex

minx2B(f)

X

i2V

wi(xi)

bi

Conclusions

  Parametric submodular optimization appears in several situations of solving machine learning problems.

  Combined with a known efficient algorithm (such as, a parametric flow algorithm), these problems could be calculated fast in practice.

  Collaborators (of the works mentioned in this talk): –  K. Nagano (Future Univ. of Hakodate) –  S. Iwata (The Univ. of Tokyo)

–  K. Aihara (The Univ. of Tokyo)

47

parametric submodular optimization in machine … submodular optimization in machine learning...

Documents