principled regularization for probabilistic matrix factorization robert bell, suhrid balakrishnan...
TRANSCRIPT
Principled Regularization for Probabilistic Matrix Factorization
Robert Bell, Suhrid BalakrishnanAT&T Labs-Research
Duke Workshop on Sensing and Analysis of High-Dimensional Data
July 26-28, 2011
2
Probabilistic Matrix Factorization (PMF)
• Approximate a large n-by-m matrix R by – M = P Q – P and Q each have k rows, k << n, m– mui = puqi
– R may be sparsely populated
• Prime tool in Netflix Prize– 99% of ratings were missing
Regularization for PMF
• Needed to avoid overfitting – Even after limiting rank of M– Critical for sparse, imbalanced data
• Penalized least squares– Minimize
3
ii
uui
iuobserved
uui qpqpr222
),(
' )(
Regularization for PMF
• Needed to avoid overfitting – Even after limiting rank of M– Critical for sparse, imbalanced data
• Penalized least squares– Minimize– or
4
ii
uui
iuobserved
uui qpqpr222
),(
' )(
i
iQu
uPi
iuobserved
uui qpqpr222
),(
' )(
Regularization for PMF
• Needed to avoid overfitting – Even after limiting rank of M– Critical for sparse, imbalanced data
• Penalized least squares– Minimize– or
– ’s selected by cross validation
5
ii
uui
iuobserved
uui qpqpr222
),(
' )(
i
iQu
uPi
iuobserved
uui qpqpr222
),(
' )(
Research Questions
• Should we use separate P and Q?
6
Research Questions
• Should we use separate P and Q?
• Should we use k separate ’s for each dimension of P and Q?
7
Matrix Completion with Noise(Candes and Plan, Proc IEEE, 2010)
• Rank reduction without explicit factors– No pre-specification of k, rank(M)
• Regularization applied directly to M– Trace norm, aka, nuclear norm– Sum of the singular values of M
• Minimize subject to
• “Equivalent” to L2 regularization for P, Q
8
*M
),(
2)(
iuobserved
uiui mr
Research Questions
• Should we use separate P and Q?
• Should we use k separate ’s for each dimension of P and Q?
• Should we use the trace norm for regularization?
9
Bayesian Matrix Factorization (BPMF) (Salakhutdinov and Mnih, ICML 2008)
• Let rui ~ N(puqi, 2)
• No PMF-type regularization• pu ~ N(P, P
-1) and qi ~ N(Q, Q-1)
• Priors for 2, P, Q, P, Q
• Fit by Gibbs sampling• Substantial reduction in prediction error
relative to PMF with L2 regularization10
Research Questions
• Should we use separate P and Q?
• Should we use k separate reg. parameters for each dimension of P and Q?
• Should we use the trace norm for regularization?
• Does BPMF “regularize” appropriately?
11
Matrix Factorization with Biases
• Let mui = + au + bi + puqi
• Regularization similar to before– Minimize
12
ii
uu
ii
uu
iuobserved
uiui qpbamr2222
),(
2)(
Matrix Factorization with Biases
• Let mui = + au + bi + puqi
• Regularization similar to before– Minimize – or
13
ii
uu
ii
uu
iuobserved
uiui qpbamr2222
),(
2)(
i
iu
Qui
Pibu
ua
iuobserved
uiui qpbamr2222
),(
2)(
Research Questions
• Should we use separate P and Q?
• Should we use k separate reg. parameters for each dimension of P and Q?
• Should we use the trace norm for regularization?
• Does BPMF “regularize” appropriately?• Should we use separate ’s for the biases?
14
Some Things this Talk Will Not Cover
• Various extensions of PMF– Combining explicit and implicit feedback– Time varying factors– Non-negative matrix factorization – L1 regularization
– ’s depending on user or item sample sizes• Efficiency of optimization algorithms
– Use Newton’s method, each coordinate separately– Iterate to convergence
15
No Need for Separate P and Q
• M = (cP)(c -1Q) is invariant for c ≠ 0• For initial P and Q
– Solve for c to minimize
– c =
– Gives
• Sufficient to let P = Q = PQ
16
22cQc QP P
41
2
2
P
Q
P
Q
QP212 QP
Bayesian Motivation for L2 Regularization
• Simplest case: only one item– R is n-by-1– Ru1 = a1 + ui, a1 ~ N(0, 2), ui ~ N(0, 2)
• Posterior mean (or MAP) of a1 satisfies– – a = ( 2/ 2)
– • Best is inversely proportional to 2
17
21
1
211 )( aar a
n
uu
11 )/(ˆ rnna a
Implications for Regularization of PMF
• Allow a ≠ b
– If a2 ≠ b
2
• Allow a ≠ b ≠ PQ
• Allow PQ1 ≠ PQ2 ≠ … ≠ PQk ?– Trace norm does not– BPMF appears to
18
Simulation Experiment Structure
• n = 2,500 users, m = 400 items• 250,000 observed ratings
– 150,000 in Training (to estimate a, b, P, Q)– 50,000 in Validation (to tune ’s)– 50,000 in Test (to estimate MSE)
• Substantial imbalance in ratings– 8 to 134 ratings per user in Training data– 33 to 988 ratings per item in Training data
19
Simulation Model
• rui = au + bi + pu1qi1 + pu2qi2 + ui
• Elements of a, b, P, Q, and – Independent normals with mean 0– Var(au) = 0.09
– Var(bi) = 0.16
– Var(pu1qi1) = 0.04
– Var(pu2qi2) = 0.01
– Var(ui) = 1.00 20
Evaluation
• Test MSE for estimation of mui = E(rui)– MSE =
• Limitations– Not real data– Only one replication– No standard errors
21
Testiniu
uiui mm),(
2)ˆ(
PMF Results for k = 0
Restrictions on ’s Values of a, b MSE for m MSE
Grand mean; no (a, b) NA .2979
22
PMF Results for k = 0
Restrictions on ’s Values of a, b MSE for m MSE
Grand mean; no (a, b) NA .2979
a = b = 0 0 .0712 -.2267
23
PMF Results for k = 0
Restrictions on ’s Values of a, b MSE for m MSE
Grand mean; no (a, b) NA .2979
a = b = 0 0 .0712 -.2267
a = b 9.32 .0678 -.0034
24
PMF Results for k = 0
Restrictions on ’s Values of a, b MSE for m MSE
Grand mean; no (a, b) NA .2979
a = b = 0 0 .0712 -.2267
a = b 9.32 .0678 -.0034
Separate a, b 9.26, 9.70 .0678 .0000
25
PMF Results for k = 1
Restrictions on ’s Values of a, b, PQ1 MSE for m MSE
Separate a, b 9.26, 9.70 .0678
26
PMF Results for k = 1
Restrictions on ’s Values of a, b, PQ1 MSE for m MSE
Separate a, b 9.26, 9.70 .0678
a = b = PQ1 11.53 .0439 -.0239
27
PMF Results for k = 1
Restrictions on ’s Values of a, b, PQ1 MSE for m MSE
Separate a, b 9.26, 9.70 .0678
a = b = PQ1 11.53 .0439 -.0239
Separate a, b, PQ1 8.50, 10.13, 13.44 .0439 .0000
28
PMF Results for k = 2
Restrictions on ’s Values of a, b, PQ1 MSE for m
MSE
Separate a, b, PQ1 8.50, 10.13, 13.44, NA .0439
29
PMF Results for k = 2
Restrictions on ’s Values of a, b, PQ1 MSE for m
MSE
Separate a, b, PQ1 8.50, 10.13, 13.44, NA .0439
a, b, PQ1 = PQ2 8.44, 9.94, 19.84, 19.84 .0441 +.0002
30
PMF Results for k = 2
Restrictions on ’s Values of a, b, PQ1 MSE for m
MSE
Separate a, b, PQ1 8.50, 10.13, 13.44, NA .0439
a, b, PQ1 = PQ2 8.44, 9.94, 19.84, 19.84 .0441 +.0002
Separate a, b, PQ1, PQ2 8.43, 10.24, 13.38, 27.30 .0428 -.0013
31
Results for Matrix Completion
• Performs poorly on raw ratings– MSE = .0693– Not designed to estimate biases
• Fit to residuals from PMF with k = 0– MSE = .0477– “Recovered” rank was 1– Worse than MSE’s from PMF: .0428 to .0439
32
Results for BPMF
• Raw ratings– MSE = .0498, using k = 3– Early stopping– Not designed to estimate biases
• Fit to residuals from PMF with k = 0– MSE = .0433, using k = 2– Near .0428, for best PMF w/ biases
33
Summary
• No need for separate P and Q
• Theory suggests using separate ’s for distinct sets of exchangeable parameters– Biases vs. factors– For individual factors
• Tentative simulation results support need for separate ’s across factors– BPMF does so automatically– PMF requires a way to do efficient tuning
34