directed regression - stanford...

1
When used to guide decisions, linear regression analysis typically involves estimation of regression coefficients and their subsequent use to make decisions. We propose directed regression, an efficient algorithm that accounts for the decision objective when computing regression coefficients. We also develop a theory that motivates this algorithm. Suppose we are given a set of training data pairs O = {(x (1) ,y (1) ), …, (x (N) ,y (N) )}. Each nth data pair is comprised of feature vectors x 1 (n) , …, x K (n) R M and a vector y (n) R M of response variables. We would like to compute regression coefficients r R K so that Consider a setting where each time we observe K k k k x r x y 1 ] | [ E Directed Regression Yi-hao Kao, Benjamin Van Roy, and Xiang Yan Stanford University Ordinary least squares (OLS) is a conventional approach to computing regression coefficients. This would produce a coefficient vector Empirical optimization (EO) minimizes empirical loss on the training data: Directed regression (DR) takes a convex combination of r OLS and r EO where the parameter λ [0,1] is computed via cross-validation. OLS and EO can be viewed as two extremes, while DR is designed to seek an optimal tradeoff between them. 2 1 1 ) ( ) ( R OLS min arg N n K k n k k n r x r y r K N n n n r r y x u l r K 1 ) ( ) ( R EO ) ), ( ( min arg EO OLS DR ) 1 ( r r r We sample each data pair (x ( n) ,y (n) ) by and x k (n) = C k ϕ ( n) for k=1,2, …, K. The vector ϕ (n) can be viewed as a sample from an underlying information space, and the matrices , , extract feature vectors from ϕ . Note ) ( 1 ) ( ~ n P i n i i (n) w C r y Consider a generative model where samples z (1) , …, z (n) represent missing features. The prior distributions of coefficients are and . We define an augmented training set = {(x (1) ,z (1) ,y (1) ), …, (x (N) ,z (N) , y (N) )} and consider optimal regression coefficients Our primary interest is in cases where prior knowledge about the coefficients r * is weak so . Along with some minor assumptions on x and z, we have the following theorems. ) ( 1 ) ( * 1 ) ( * n J j n j j K k n k k (n) w z r x r y Theorem 1 and OLS 0 ˆ lim r r ] | ) ), ( ( [ E min arg ˆ R O y x u l r r r K ) , 0 ( ~ 2 * K r I Ν r ) , 0 ( ~ 2 * J I Ν r O r EO ˆ lim r r Theorem 2 For any other , EO OLS ) 1 ( ˆ r r r 2 1 Introduction 2 Regression for Decision 3 Algorithm 4 Computational Results 5 Theoretical Analysis feature vectors x 1 , …, x K we will have to select a decision u R L before observing the response vector y. The decision incurs quadratic loss We aim to minimize expected loss, assuming that the conditional expectation of y given x is indeed the linear combination described above. As such, we select a decision The question is how best to compute the regression coefficients r for this purpose. y G u u G u y u l 2 T 1 T ) , ( Example Consider an Internet banner ad campaign that targets M classes of customers. An average revenue of y m is received per customer of class m that the campaign reaches. This quantity is random and influenced by K observable factors x 1m , …, x Km . For each mth class, we pay γ m u m 2 to reach u m customers. It is natural to predict the response vector y using a linear combination of factors x k .The goal is to maximize expected revenue less advertising costs. This gives rise to a loss function M m m m m m y u u y u l 1 2 ). ( ) , ( information space, and the matrices C 1 , , C P extract feature vectors from ϕ (n) . Note that although response variables depend on P feature vectors, only K P are used in the regression model. We let P =60 and carried out two sets of experiments. In the first set, we fix K=50. Figure 1(a) plots the average excess losses for different N. In the second set, we fix N=20. Figure1(b) plots the results for different K. Figure 2 further plots the average values of λ selected by cross-validation. Figure1: (a)(left) Excess losses delivered by OLS, EO, and DR, for different numbers of training samples. (b)(right) Excess losses for different numbers features. Figure 2: (a)(left) The average values of selected λ for different numbers training samples. (b)(right) The average values of selected λ for different numbers of the features. where ) 1 ( ˆ r r r 2 2 2 w N N Our work suggests that there can be significant gains in broader problem settings from a tighter coupling between machine learning and decision- making. It will be interesting to explore How to do this with other classes of models and objectives? How to synthesize DR with feature subset selection methods? How to extend DR to the multi-period case? The effectiveness of cross-validation in optimizing λ? [1] J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression. Annales de l’Institut Henri Poincare Probability and Statistics, 2004. [2] P. L. Bartlett and S. Mendelson. Empirical minimization. ProbabilityTheory and Related Fields, 2006. [3] D. Bertsimas and A. Thiele. Robust and data-driven optimization: Modern decision-making under uncertainty. In Tutorials on Operations Research, 2006. [4] O. Besbes, R. Philips, and A. Zeevi. Testing the validity of a demand model: An operations perspective, 2007. [5] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals of Statistics, 2007. [6] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 1992. [7] K. Kim and N. Timm. Univariate and Multivariate General Linear Models: Theory and Applications with SAS, 2006. [8] K. E. Muller and P.W. Stewart. Linear ModelTheory: Univariate, Multivariate, and Mixed Models, 2006. [9] R. S. Sutton and A. G. Barto. Reinforcement Learning:An Introduction, 1998. [10] R.Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, 1996. 6 Extensions References K k k k K k k k u r x r G G x r u l x u 1 2 1 1 1 2 1 , min arg ) ( ) (

Upload: others

Post on 23-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Directed Regression - Stanford Universityforum.stanford.edu/events/posterslides/DirectedRegression.pdf · • Directed regression (DR) takes a convex combination of rOLS and rEO where

When used to guide decisions, linear regressionanalysis typically involves estimation of regressioncoefficients and their subsequent use to makedecisions. We propose directed regression, anefficient algorithm that accounts for the decisionobjective when computing regression coefficients.We also develop a theory that motivates thisalgorithm.

Suppose we are given a set of training data pairs O ={(x(1), y(1)), …, (x(N), y(N))}. Each nth data pair iscomprised of feature vectors x1

(n), …, xK(n) ∈ RM and

a vector y(n) ∈ RM of response variables. We wouldlike to compute regression coefficients r ∈ RK sothat

Consider a setting where each time we observe

K

k kk xrxy1

]|[E

Directed RegressionYi-hao Kao, Benjamin Van Roy, and Xiang Yan

Stanford University

• Ordinary least squares (OLS) is a conventional approach to computing regressioncoefficients. This would produce a coefficient vector

• Empirical optimization (EO) minimizes empirical loss on the training data:

• Directed regression (DR) takes a convex combination of rOLS and rEO

where the parameter λ ∈ [0,1] is computed via cross-validation. OLS and EO can beviewed as two extremes, while DR is designed to seek an optimal tradeoff betweenthem.

2

1 1

)()(

R

OLS minarg

N

n

K

k

nkk

n

rxryr

K

N

n

nnr

ryxulr

K 1

)()(

R

EO )),((minarg

EOOLSDR )1( rrr

We sample each data pair (x(n), y(n)) by and xk(n) = Ckϕ(n) for

k=1,2, …, K. The vector ϕ(n) can be viewed as a sample from an underlyinginformation space, and the matrices , …, extract feature vectors from ϕ . Note

)(1

)(~ nP

in

ii(n) wCry

Consider a generative model

where samples z(1), …, z(n) represent missing features. The priordistributions of coefficients are and .We define an augmented training set = {(x(1), z(1), y(1)), …, (x(N), z(N),y(N))} and consider optimal regression coefficients

Our primary interest is in cases where prior knowledge about thecoefficients r* is weak so . Along with some minor assumptionson x and z, we have the following theorems.

)(

1

)(*

1

)(* nJ

j

njj

K

k

nkk

(n) wzrxry

Theorem 1 andOLS

0ˆlim rr

]|)),(([EminargˆR

Oyxulr rr K

),0(~ 2*Kr IΝr ),0(~ 2*

JIΝr

O

r

EOˆlim rr

Theorem 2 For any other ,EOOLS)1(ˆ rrr 2

1 Introduction

2 Regression for Decision

3 Algorithm

4 Computational Results

5 Theoretical Analysis

Consider a setting where each time we observefeature vectors x1 , …, xK we will have to select adecision u ∈ RL before observing the responsevector y. The decision incurs quadratic loss

We aim to minimize expected loss, assuming thatthe conditional expectation of y given x is indeedthe linear combination described above. As such, weselect a decision

The question is how best to compute theregression coefficients r for this purpose.

yGuuGuyul 2T

1T),(

Example Consider an Internet banner ad campaign that targets M classes of customers. An average revenue of ym is received per customer of class m that the campaign reaches. This quantity is random and influenced by K observable factors x1m , …, xKm. For each mth class, we pay γmum

2 to reach um customers. It is natural to predict the response vector y using a linear combination of factors xk. The goal is to maximize expected revenue less advertising costs. This gives rise to a loss function

M

mmmmm yuuyul

1

2 ).(),(

information space, and the matrices C1, …, CP extract feature vectors from ϕ(n). Notethat although response variables depend on P feature vectors, only K ≤ P are used inthe regression model. We let P=60 and carried out two sets of experiments. In thefirst set, we fix K=50. Figure 1(a) plots the average excess losses for different N. Inthe second set, we fix N=20. Figure1(b) plots the results for different K. Figure 2further plots the average values of λ selected by cross-validation.

Figure1: (a)(left) Excess losses delivered by OLS, EO, and DR, for different numbers of training samples.(b)(right) Excess losses for different numbers features.

Figure 2: (a)(left) The average values of selected λ for different numbers training samples. (b)(right) The average values of selected λ for different numbers of the features.

where)1(ˆ rrr

22

2

wNN

Our work suggests that there can be significant gains in broader problemsettings from a tighter coupling between machine learning and decision-making. It will be interesting to explore• How to do this with other classes of models and objectives?• How to synthesize DR with feature subset selection methods?• How to extend DR to the multi-period case?• The effectiveness of cross-validation in optimizing λ?

[1] J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression. Annalesde l’Institut Henri Poincare Probability and Statistics, 2004.[2] P. L. Bartlett and S. Mendelson. Empirical minimization. ProbabilityTheory and Related Fields, 2006.[3] D. Bertsimas and A. Thiele. Robust and data-driven optimization: Modern decision-making underuncertainty. In Tutorials on Operations Research, 2006.[4] O. Besbes, R. Philips, and A. Zeevi. Testing the validity of a demand model: An operationsperspective, 2007.[5] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals ofStatistics, 2007.[6] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and otherlearning applications. Information and Computation,1992.[7] K. Kim and N. Timm. Univariate and Multivariate General Linear Models: Theory and Applications withSAS, 2006.[8] K. E. Muller and P.W. Stewart. Linear ModelTheory: Univariate, Multivariate, and Mixed Models, 2006.[9] R. S. Sutton and A. G. Barto. Reinforcement Learning:An Introduction, 1998.[10] R.Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, 1996.

6 Extensions

References

K

kkk

K

k kku

r xrGGxrulxu1

21

11 21,minarg)( )(