expectation-maximization for learning determinantal point ...jgillenw.com/nips2014-poster.pdfaffandi...
TRANSCRIPT
TASK: SUBSET SELECTION
Two goals: relevance and diversity
Quality learning:learn weight for each row of KKulesza and Taskar (UAI 2011)
Code + Data: https://code.google.com/p/em-for-dpps
Relevance only:
Relevance + diversity:
...
...
SIMILARITY KERNEL
DETERMINANTAL POINT PROCESS
Exact and efficient inference: or faster• marginalization:• normalized probabilities:• conditioning:• sampling:
But: DPP max likelihood learning is (conjectured) NP-hard
DPP INFERENCE
P(Y ✓ Y ) = det(KY )
P(Y = Y | A ✓ Y ),P(Y = Y | A \ Y = ;)
O(N3)
DPP LOG-LIKELIHOOD
EXPECTATION-MAXIMIZATION
Ground set of N items:
Training data (example subsets):
Expectation-Maximization for Learning Determinantal Point Processes
PRIOR WORK
Sum-of-experts model:weighted sum of fixed DPPs
Kulesza and Taskar (ICML 2011)
Parametric kernel:K assumed Gaussian, poly, etc.
Affandi et al. (ICML 2014)
This work:no fixed values or restrictive
parameterizations for K
LOG-LIKELIHOOD IS NON-CONCAVE
BASELINE: PROJECTED GRADIENT
Training data:
PRODUCT RECOMMENDATION DATASET
LOWER-BOUNDING
COORDINATE ASCENT
E-STEP
M-STEP
LOG-LIKELIHOOD RESULTS
EX: “SAFETY” CATEGORY
VTech Comm. Audio Monitor
Infant Optics Video Monitor
Graco Sweet Slumber Sound
Boppy Noggin Nest Head Support
Cloud b Twilight Night Light
Braun ThermoScan Lens Filters
Britax EZ-Cling Sun Shades
TL Care Organic Cotton Mittens
Regalo Easy Step Walk Thru Gate
Aquatopia Bath Thermometer Alarm
~30,000 Amazon baby registries, 13 product categories
furniture carseats toys
...
Left: 10-item registry for the “safety” category under the model learned by projected gradient. Right: The EM model replaces two redundant
items (crossed out) with two more diverse items.
Goal: learn a DPP for each category
RUNTIME RESULTS
NAIVE INITIALIZATION
MOMENTS-MATCHING INITIALIZATION
Draw starting point from a Wishart with
N degrees of freedom
Data-dependent starting point
mi =1T
PTt=1 1(i 2 Yt),
mij =1T
PTt=1 1(i 2 Yt ^ j 2 Yt)
˜Kii = mi,
˜Kij =
rmax
⇣˜Kii
˜Kjj �mij , 0⌘
T1 = iterations to converge
T2 = iterations to find a step size
T = # of example sets
N = ground set size
Low-data setting
Using all data: EM and baseline are comparable
T = 2N
KJ = V Jdiag(1)(V J)>
Eigendecompose
Sample “hidden”
eigenvectors J
v4 v6v1 v2 v3 v5
�1 �2 �3 �4 �5 �6
v4 v6v1 v2 v3 v5
Sample subset
based on J
Y ⇠ DPP(KJ)Step 1 Step 2 Step 3 Step 4
Step 5 Step 6 Step 7 Step 8
GENERATIVE DPP SAMPLING
Main idea: Exploit hidden variable J to develop EM-style optimization
L(K) = L(V,⇤) =TX
t=1
log pK(Yt) =
TX
t=1
log
X
J
pK(J, Yt)
!
=
TX
t=1
log
X
J
q(J | Yt)pK(J, Yt)
q(J | Yt)
!
�TX
t=1
X
J
q(J | Yt) log
✓pK(J, Yt)
q(J | Yt)
◆⌘ F (q, V,⇤)
E-step: max
qF = min
q
TX
t=1
KL(q(J | Yt) k pK(J | Yt))
M-step: max
V,⇤F = max
V,⇤
TX
t=1
Eq[log pK(J, Yt)]
s.t. 0 � 1, V >V = I
Set distributions equal to minimize KL:
Theorem: If pK(Y | J) is a DPP distribution compactly parameterized by K,
then pK(J | Y ) is also a DPP distribution with a related compact
parameterization computable in at most O(N3) time.
q(J | Yt) = pK(J | Yt)
The eigenvalue update is closed-form. Constraints are implicitly enforced, since marginals of q are in [0,1]. (No need for a projection step.)
Because q is a DPP, the marginalization required for this update is efficient.
O(TNk2), for k = maxt |Yt|
The eigenvector update is more complicated, but still efficient. The constraints correspond to optimization over Stiefel manifold, and Edelman et al. (SIMAX 1998) show that the below update stays on the manifold.
Infant Optics Video Monitor
Motorola Video Monitor
Summer Infant Video Monitor
Aquatopia Bath Thermometer Alarm
Regalo Easy Step Walk Thru Gate
Infant Optics Video Monitor
0.30.0
0.0
0.0
0.0
0.5 0.38
0.38 0.4
P(Y = Y ) = | det(K � IY )|
Example: product recommendation
Y ⇠ DPP(K) =) P(Y ✓ Y ) = det(KY )
L(K) =
TPt=1
log
�| det(K � IY t
)|�
0 0.1 0.2 0.3 0.4 0.5
−1.49
−1.48
−1.47
η
log−likelihoodK =
0.5 0.00.0 0.4
�+
⌘
0.1 0.60.6 �0.1
�
1
1�1
�2
K(0)
K(0) + ⌘ @L(K(0))@KK(1)
Main idea: Exploit hidden variable J to develop EM-style optimization
Gradient:
0 � 1
@L(K)@K =
TPt=1
(K � IY t)�1
Project to satisfy constraints on
eigenvalues of K:
safety furniturecarseatsstrollers
health bath
media toys
bedding apparel diaper
gear feeding
relative log likelihood difference: 100(EM - projgrad) / |projgrad|
0.00.0
0.50.9
1.31.8
2.42.52.5
7.78.1
9.811.0
safety furniturecarseatsstrollers
health bath
media toys
bedding apparel diaper
gear feeding
relative log likelihood difference: 100(EM - projgrad) / |projgrad|
0.62.6
-0.11.5
3.12.3
1.93.5
5.35.3
5.810.4
16.5
feeding gear
bedding bath
apparel diaper media
furniturehealth
toys safety
carseats strollers
speedup factor: projgrad runtime / EM runtime
0.40.50.60.70.7
1.11.31.3
1.92.2
4.06.0
7.4
P(Y ) = | det(K � IY )|
=X
J:J✓{1,...,N}
PKJ (Y )Y
j:j2J
�j
Y
j:j /2J
(1� �j)
=X
J:J✓{1,...,N}
pK(Y | J)pK(J)
EM : O(T1T2N3+ T1TNk2)
Proj grad : O(T1T2N3+ T1TN
3)
�j =1
T
TX
t=1
X
J:j2J
q(J | Yt) =1
T
TX
t=1
q(j 2 J | Yi)
concave in q
concave in ⇤
not concave in V
V 0 V exp
h⌘⇣V > @L
@V ��@L@V
�>V⌘i
EIGENVALUES
EIGENVECTORS
K = V ⇤V >
KY = [V ⇤V >]Y = V Y ⇤(V Y )> =)
q(J | Y ) / det
✓V >
Y V Y diag
✓�
1� �
◆�
J
◆
Motorola Video Monitor
Summer Infant Video Monitor
X XY ⇠ DPP(K)
P⇣
✓ Y⌘= det(K{1,2}) = det
✓0.5 0.380.38 0.4
◆= 0.05
P⇣
✓ Y⌘= det(K{1,3}) = det
✓0.5 0.00.0 0.3
◆= 0.15
P�
✓ Y�= det(K{1}) = K11 = 0.5
K =Constraints:
0 � K � 1