expectation-maximization for learning determinantal point ...jgillenw.com/nips2014-poster.pdfaffandi...

TASK: SUBSET SELECTION

Two goals: relevance and diversity

Quality learning:learn weight for each row of KKulesza and Taskar (UAI 2011)

Code + Data: https://code.google.com/p/em-for-dpps

Relevance only:

Relevance + diversity:

...

...

SIMILARITY KERNEL

DETERMINANTAL POINT PROCESS

Exact and efficient inference: or faster• marginalization:• normalized probabilities:• conditioning:• sampling:

But: DPP max likelihood learning is (conjectured) NP-hard

DPP INFERENCE

P(Y ✓ Y ) = det(KY )

P(Y = Y | A ✓ Y ),P(Y = Y | A \ Y = ;)

O(N3)

DPP LOG-LIKELIHOOD

EXPECTATION-MAXIMIZATION

Ground set of N items:

Training data (example subsets):

Expectation-Maximization for Learning Determinantal Point Processes

PRIOR WORK

Sum-of-experts model:weighted sum of fixed DPPs

Kulesza and Taskar (ICML 2011)

Parametric kernel:K assumed Gaussian, poly, etc.

Affandi et al. (ICML 2014)

This work:no fixed values or restrictive

parameterizations for K

LOG-LIKELIHOOD IS NON-CONCAVE

BASELINE: PROJECTED GRADIENT

Training data:

PRODUCT RECOMMENDATION DATASET

LOWER-BOUNDING

COORDINATE ASCENT

E-STEP

M-STEP

LOG-LIKELIHOOD RESULTS

EX: “SAFETY” CATEGORY

VTech Comm. Audio Monitor

Infant Optics Video Monitor

Graco Sweet Slumber Sound

Boppy Noggin Nest Head Support

Cloud b Twilight Night Light

Braun ThermoScan Lens Filters

Britax EZ-Cling Sun Shades

TL Care Organic Cotton Mittens

Regalo Easy Step Walk Thru Gate

Aquatopia Bath Thermometer Alarm

~30,000 Amazon baby registries, 13 product categories

furniture carseats toys

...

Left: 10-item registry for the “safety” category under the model learned by projected gradient. Right: The EM model replaces two redundant

items (crossed out) with two more diverse items.

Goal: learn a DPP for each category

RUNTIME RESULTS

NAIVE INITIALIZATION

MOMENTS-MATCHING INITIALIZATION

Draw starting point from a Wishart with

N degrees of freedom

Data-dependent starting point

mi =1T

PTt=1 1(i 2 Yt),

mij =1T

PTt=1 1(i 2 Yt ^ j 2 Yt)

˜Kii = mi,

˜Kij =

rmax

⇣˜Kii

˜Kjj �mij , 0⌘

T1 = iterations to converge

T2 = iterations to find a step size

T = # of example sets

N = ground set size

Low-data setting

Using all data: EM and baseline are comparable

T = 2N

KJ = V Jdiag(1)(V J)>

Eigendecompose

Sample “hidden”

eigenvectors J

v4 v6v1 v2 v3 v5

�1 �2 �3 �4 �5 �6

v4 v6v1 v2 v3 v5

Sample subset

based on J

Y ⇠ DPP(KJ)Step 1 Step 2 Step 3 Step 4

Step 5 Step 6 Step 7 Step 8

GENERATIVE DPP SAMPLING

Main idea: Exploit hidden variable J to develop EM-style optimization

L(K) = L(V,⇤) =TX

t=1

log pK(Yt) =

TX

t=1

log

X

J

pK(J, Yt)

!

=

TX

t=1

log

X

J

q(J | Yt)pK(J, Yt)

q(J | Yt)

!

�TX

t=1

X

J

q(J | Yt) log

✓pK(J, Yt)

q(J | Yt)

◆⌘ F (q, V,⇤)

E-step: max

qF = min

q

TX

t=1

KL(q(J | Yt) k pK(J | Yt))

M-step: max

V,⇤F = max

V,⇤

TX

t=1

Eq[log pK(J, Yt)]

s.t. 0 � 1, V >V = I

Set distributions equal to minimize KL:

Theorem: If pK(Y | J) is a DPP distribution compactly parameterized by K,

then pK(J | Y ) is also a DPP distribution with a related compact

parameterization computable in at most O(N3) time.

q(J | Yt) = pK(J | Yt)

The eigenvalue update is closed-form. Constraints are implicitly enforced, since marginals of q are in [0,1]. (No need for a projection step.)

Because q is a DPP, the marginalization required for this update is efficient.

O(TNk2), for k = maxt |Yt|

The eigenvector update is more complicated, but still efficient. The constraints correspond to optimization over Stiefel manifold, and Edelman et al. (SIMAX 1998) show that the below update stays on the manifold.


Motorola Video Monitor

Summer Infant Video Monitor

Aquatopia Bath Thermometer Alarm

Regalo Easy Step Walk Thru Gate


0.30.0

0.0

0.0

0.0

0.5 0.38

0.38 0.4

P(Y = Y ) = | det(K � IY )|

Example: product recommendation

Y ⇠ DPP(K) =) P(Y ✓ Y ) = det(KY )

L(K) =

TPt=1

log

�| det(K � IY t

)|�

0 0.1 0.2 0.3 0.4 0.5

−1.49

−1.48

−1.47

η

log−likelihoodK =

0.5 0.00.0 0.4

�+

⌘

0.1 0.60.6 �0.1

�

1

1�1

�2

K(0)

K(0) + ⌘ @L(K(0))@KK(1)

Main idea: Exploit hidden variable J to develop EM-style optimization

Gradient:

0 � 1

@L(K)@K =

TPt=1

(K � IY t)�1

Project to satisfy constraints on

eigenvalues of K:

safety furniturecarseatsstrollers

health bath

media toys

bedding apparel diaper

gear feeding

relative log likelihood difference: 100(EM - projgrad) / |projgrad|

0.00.0

0.50.9

1.31.8

2.42.52.5

7.78.1

9.811.0

safety furniturecarseatsstrollers

health bath

media toys

bedding apparel diaper

gear feeding

relative log likelihood difference: 100(EM - projgrad) / |projgrad|

0.62.6

-0.11.5

3.12.3

1.93.5

5.35.3

5.810.4

16.5

feeding gear

bedding bath

apparel diaper media

furniturehealth

toys safety

carseats strollers

speedup factor: projgrad runtime / EM runtime

0.40.50.60.70.7

1.11.31.3

1.92.2

4.06.0

7.4

P(Y ) = | det(K � IY )|

=X

J:J✓{1,...,N}

PKJ (Y )Y

j:j2J

�j

Y

j:j /2J

(1� �j)

=X

J:J✓{1,...,N}

pK(Y | J)pK(J)

EM : O(T1T2N3+ T1TNk2)

Proj grad : O(T1T2N3+ T1TN

3)

�j =1

T

TX

t=1

X

J:j2J

q(J | Yt) =1

T

TX

t=1

q(j 2 J | Yi)

concave in q

concave in ⇤

not concave in V

V 0 V exp

h⌘⇣V > @L

@V ��@L@V

�>V⌘i

EIGENVALUES

EIGENVECTORS

K = V ⇤V >

KY = [V ⇤V >]Y = V Y ⇤(V Y )> =)

q(J | Y ) / det

✓V >

Y V Y diag

✓�

1� �

◆�

J

◆

Motorola Video Monitor

Summer Infant Video Monitor

X XY ⇠ DPP(K)

P⇣

✓ Y⌘= det(K{1,2}) = det

✓0.5 0.380.38 0.4

◆= 0.05

P⇣

✓ Y⌘= det(K{1,3}) = det

✓0.5 0.00.0 0.3

◆= 0.15

P�

✓ Y�= det(K{1}) = K11 = 0.5

K =Constraints:

0 � K � 1

https://code.google.com/p/em-for-dpps

https://code.google.com/p/em-for-dpps

expectation-maximization for learning determinantal point ...jgillenw.com/nips2014-poster.pdfaffandi...

Documents