the self-normalized estimator for counterfactual learningadith/poem/nips_poster.pdfneural...

1
Neural Information Processing Systems Foundation The Self-Normalized Estimator for Counterfactual Learning Adith Swaminathan and Thorsten Joachims Department of Computer Science, Cornell University Neural Information Processing Systems Foundation Setting: Batch learning from logged bandit feedback Can we re-use the logs of interactive systems to reliably train them offline? Use x i ,y i i n i=1 to find a good policy h(y | x). Challenge: Logs are biased and incomplete. Approach: Importance sampling Inject randomization in the system, y i h 0 (y | x i ). Log the propensities p i = h 0 (y i | x i ) [5]. ˆ R(h) | {z } Risk of new policy h = 1 n n X i=1 δ i |{z} Feedback 1 h 0 (y i | x i ) | {z } Propensity, p i h(y i | x i ) | {z } New policy . Minimize an upper bound on the empirical risk [2]. h erm = argmin h∈H ˆ R(h) | {z } Empirical risk +λ Reg (h) | {z } Regularizer . ˆ R(h) is unbiased but flawed. Problem Fix Unbounded variance. Threshold the propensities [4]. Non-uniform variance. Use empirical variance regularizers [2]. Propensity overfitting. Deal-breaker. Propensity overfitting Self-Normalized estimator Idea: Use importance sampling diagnostics to detect overfitting [1]. ˆ S (h)= 1 n n X i=1 h(y i | x i ) p i . h ∈H, E " ˆ S (h) # =1. Employ ˆ S (h) as a multiplicative control variate to get the self-normalized estimator [6]. ˆ R sn (h) = n X i=1 δ i h(y i | x i ) p i / n X i=1 h(y i | x i ) p i . ˆ R sn (h) is biased but asymptotically consistent. typically has lower variance than ˆ R(h). is equivariant. Norm-POEM: Normalized Policy Optimizer for Exponential Models Exponential Models assume: h w (y | x) exp w · φ(x, y ). w * = argmin w ˆ R sn (h w ) + λ v u u u u u t ˆ V ar ( ˆ R sn (h w )) n + μw 2 . Non-convex optimization over w . Gradient descent (e.g. l-BFGS) still works well. For a simple Conditional Random Field prototype that can learn from logged bandit feedback to predict structured outputs, please visit http://www.cs.cornell.edu/~adith/poem . Experiments Supervised Bandit [3] Multi-Label classification with δ Hamming loss on four datasets. Scene Yeast LYRL TMC 0.1 0.2 0.3 0.4 Hamming loss h 0 POEM Norm - POEM 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 n(×1500) Yeast Hamming loss h 0 POEM Norm - POEM Norm-POEM significantly outperforms the usual estimator (POEM) on all datasets. Scene Yeast LYRL TMC 0 0.2 0.4 0.6 0.8 1 δ> 0 ˆ S (h) POEM Norm - POEM Scene Yeast LYRL TMC 1 2 3 4 5 δ< 0 ˆ S (h) POEM Norm - POEM Norm-POEM indeed avoids overfitting ˆ S (h) and is equivariant. Hamming loss Norm-IPS Norm-POEM Scene 1.072 1.045 Yeast 3.905 3.876 TMC 3.609 2.072 LYRL 0.806 0.799 Time(s) Scene Yeast TMC LYRL POEM (l-bfgs) 78.69 98.65 716.51 617.30 Norm-POEM (l-bfgs) 7.28 10.15 227.88 142.50 CRF (scikit-learn) 4.94 3.43 89.24 72.34 Norm-POEM still benefits from variance regularization and is quick to optimize. Open questions What property (apart from equivariance) of an estimator ensures good optimization? Can we make a more informed bias-variance trade-off when constructing these estimators? How can we reliably optimize these objectives at scale? References [1] Adith Swaminathan and Thorsten Joachims. The Self-Normalized Estimator for Counterfactual Learning. In NIPS, 2015. [2] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, 2015. [3] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In KDD, 2009. [4] Alexander L. Strehl, John Langford, Lihong Li, and Sham Kakade. Learning from logged implicit exploration data. In NIPS, 2010. [5] Léon Bottou, Jonas Peters, Joaquin Q. Candela, Denis X. Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Y. Simard, and Ed Snelson. Counterfactual reasoning and learning systems: the example of computational advertising. Journal of Machine Learning Research, 14(1):3207–3260, 2013. [6] Tim Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37:185–194, 1995. Acknowledgment This research was funded through NSF Award IIS- 1247637, IIS-1217686, IIS-1513692, the JTCII Cornell-Technion Research Fund, and a gift from Bloomberg.

Upload: others

Post on 27-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Self-Normalized Estimator for Counterfactual Learningadith/POEM/NIPS_poster.pdfNeural Information Processing Systems Foundation The Self-Normalized Estimator for Counterfactual

™Neural Information

Processing Systems

Foundation

The Self-Normalized Estimator for Counterfactual LearningAdith Swaminathan and Thorsten JoachimsDepartment of Computer Science, Cornell University

™Neural Information

Processing Systems

Foundation

Setting: Batch learning from logged bandit feedback

Can we re-use the logs of interactive systems to reliably train them offline?

Use 〈xi, yi, δi〉ni=1 to find a good policy h(y | x).Challenge: Logs are biased and incomplete.

Approach: Importance sampling

Inject randomization in the system, yi ∼ h0(y | xi). Log the propensities pi = h0(yi | xi) [5].

R(h)︸ ︷︷ ︸Risk of new policy h

= 1n

n∑i=1

δi︸︷︷︸Feedback

1h0(yi | xi)︸ ︷︷ ︸Propensity, pi

h(yi | xi)︸ ︷︷ ︸New policy

.

Minimize an upper bound on the empirical risk [2].

herm = argminh∈H

R(h)︸ ︷︷ ︸Empirical risk

+λReg(h)︸ ︷︷ ︸Regularizer

.

R(h) is unbiased but flawed.

Problem Fix

Unbounded variance. ⇒ ⇒ Threshold the propensities [4].Non-uniform variance. ⇒ ⇒ Use empirical variance regularizers [2].Propensity overfitting. ⇒ ⇒ Deal-breaker.

Propensity overfitting

Self-Normalized estimator

Idea: Use importance sampling diagnostics to detect overfitting [1].

S(h) = 1n

n∑i=1

h(yi | xi)pi

. ∀h ∈ H, E[S(h)

]= 1.

Employ S(h) as a multiplicative control variate to get the self-normalized estimator [6].

Rsn(h) =n∑i=1

δih(yi | xi)pi

/n∑i=1

h(yi | xi)pi

.

Rsn(h) is biased but asymptotically consistent.typically has lower variance than R(h).is equivariant.

Norm-POEM: Normalized Policy Optimizer for Exponential Models

Exponential Models assume: hw(y | x) ∝ exp 〈w · φ(x, y)〉.

w∗ = argminw

Rsn(hw) + λ

√√√√√√√ ˆV ar(Rsn(hw))n

+ µ‖w‖2.

Non-convex optimization over w. Gradient descent (e.g. l-BFGS) still works well.

For a simple Conditional Random Field prototype that can learn from logged banditfeedback to predict structured outputs, please visithttp://www.cs.cornell.edu/~adith/poem.

Experiments

Supervised 7→ Bandit [3] Multi-Label classification with δ ≡ Hamming loss on four datasets.

Scene Yeast LYRL TMC

0.1

0.2

0.3

0.4

Ham

mingloss

h0 POEM Norm− POEM

20 21 22 23 24 25 26 27 28

n(×1500) Yeast

Ham

mingloss

h0 POEM Norm− POEM

Norm-POEM significantly outperforms the usual estimator (POEM) on all datasets.

Scene Yeast LYRL TMC0

0.20.40.60.8

1

δ > 0

S(h

)

POEM Norm− POEM

Scene Yeast LYRL TMC

12345

δ < 0

S(h

)

POEM Norm− POEM

Norm-POEM indeed avoids overfitting S(h) and is equivariant.

Hamming loss Norm-IPS Norm-POEMScene 1.072 1.045Yeast 3.905 3.876TMC 3.609 2.072LYRL 0.806 0.799

Time(s) Scene Yeast TMC LYRLPOEM (l-bfgs) 78.69 98.65 716.51 617.30Norm-POEM (l-bfgs) 7.28 10.15 227.88 142.50CRF (scikit-learn) 4.94 3.43 89.24 72.34

Norm-POEM still benefits from variance regularization and is quick to optimize.

Open questions

•What property (apart from equivariance) of an estimator ensures good optimization?•Can we make a more informed bias-variance trade-off when constructing these estimators?•How can we reliably optimize these objectives at scale?

References

[1] Adith Swaminathan and Thorsten Joachims. The Self-Normalized Estimator for Counterfactual Learning. In NIPS, 2015.[2] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In ICML, 2015.[3] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In KDD, 2009.[4] Alexander L. Strehl, John Langford, Lihong Li, and Sham Kakade. Learning from logged implicit exploration data. In NIPS, 2010.[5] Léon Bottou, Jonas Peters, Joaquin Q. Candela, Denis X. Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Y. Simard, and Ed Snelson. Counterfactual reasoning and learning

systems: the example of computational advertising. Journal of Machine Learning Research, 14(1):3207–3260, 2013.[6] Tim Hesterberg. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37:185–194, 1995.

Acknowledgment

This research was funded through NSF Award IIS- 1247637, IIS-1217686, IIS-1513692, the JTCII Cornell-Technion Research Fund, and a gift from Bloomberg.