directed regression - stanford...

When used to guide decisions, linear regressionanalysis typically involves estimation of regressioncoefficients and their subsequent use to makedecisions. We propose directed regression, anefficient algorithm that accounts for the decisionobjective when computing regression coefficients.We also develop a theory that motivates thisalgorithm.

Suppose we are given a set of training data pairs O ={(x(1), y(1)), …, (x(N), y(N))}. Each nth data pair iscomprised of feature vectors x1

(n), …, xK(n) ∈ RM and

a vector y(n) ∈ RM of response variables. We wouldlike to compute regression coefficients r ∈ RK sothat

Consider a setting where each time we observe

∈

K

k kk xrxy1

]|[E

Directed RegressionYi-hao Kao, Benjamin Van Roy, and Xiang Yan

Stanford University

• Ordinary least squares (OLS) is a conventional approach to computing regressioncoefficients. This would produce a coefficient vector

• Empirical optimization (EO) minimizes empirical loss on the training data:

• Directed regression (DR) takes a convex combination of rOLS and rEO

where the parameter λ ∈ [0,1] is computed via cross-validation. OLS and EO can beviewed as two extremes, while DR is designed to seek an optimal tradeoff betweenthem.

2

1 1

)()(

R

OLS minarg

N

n

K

k

nkk

n

rxryr

K

N

n

nnr

ryxulr

K 1

)()(

R

EO )),((minarg

EOOLSDR )1( rrr

We sample each data pair (x(n), y(n)) by and xk(n) = Ckϕ(n) for

k=1,2, …, K. The vector ϕ(n) can be viewed as a sample from an underlyinginformation space, and the matrices , …, extract feature vectors from ϕ . Note

)(1

)(~ nP

in

ii(n) wCry

Consider a generative model

where samples z(1), …, z(n) represent missing features. The priordistributions of coefficients are and .We define an augmented training set = {(x(1), z(1), y(1)), …, (x(N), z(N),y(N))} and consider optimal regression coefficients

Our primary interest is in cases where prior knowledge about thecoefficients r* is weak so . Along with some minor assumptionson x and z, we have the following theorems.

)(

1

)(*

1

)(* nJ

j

njj

K

k

nkk

(n) wzrxry

Theorem 1 andOLS

0ˆlim rr

]|)),(([EminargˆR

Oyxulr rr K

),0(~ 2*Kr IΝr ),0(~ 2*

JIΝr

O

r

EOˆlim rr

Theorem 2 For any other ,EOOLS)1(ˆ rrr 2

1 Introduction

2 Regression for Decision

3 Algorithm

4 Computational Results

5 Theoretical Analysis

Consider a setting where each time we observefeature vectors x1 , …, xK we will have to select adecision u ∈ RL before observing the responsevector y. The decision incurs quadratic loss

We aim to minimize expected loss, assuming thatthe conditional expectation of y given x is indeedthe linear combination described above. As such, weselect a decision

The question is how best to compute theregression coefficients r for this purpose.

yGuuGuyul 2T

1T),(

Example Consider an Internet banner ad campaign that targets M classes of customers. An average revenue of ym is received per customer of class m that the campaign reaches. This quantity is random and influenced by K observable factors x1m , …, xKm. For each mth class, we pay γmum

2 to reach um customers. It is natural to predict the response vector y using a linear combination of factors xk. The goal is to maximize expected revenue less advertising costs. This gives rise to a loss function

M

mmmmm yuuyul

1

2 ).(),(

information space, and the matrices C1, …, CP extract feature vectors from ϕ(n). Notethat although response variables depend on P feature vectors, only K ≤ P are used inthe regression model. We let P=60 and carried out two sets of experiments. In thefirst set, we fix K=50. Figure 1(a) plots the average excess losses for different N. Inthe second set, we fix N=20. Figure1(b) plots the results for different K. Figure 2further plots the average values of λ selected by cross-validation.

Figure1: (a)(left) Excess losses delivered by OLS, EO, and DR, for different numbers of training samples.(b)(right) Excess losses for different numbers features.

Figure 2: (a)(left) The average values of selected λ for different numbers training samples. (b)(right) The average values of selected λ for different numbers of the features.

where)1(ˆ rrr

22

2

wNN

Our work suggests that there can be significant gains in broader problemsettings from a tighter coupling between machine learning and decision-making. It will be interesting to explore• How to do this with other classes of models and objectives?• How to synthesize DR with feature subset selection methods?• How to extend DR to the multi-period case?• The effectiveness of cross-validation in optimizing λ?

[1] J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression. Annalesde l’Institut Henri Poincare Probability and Statistics, 2004.[2] P. L. Bartlett and S. Mendelson. Empirical minimization. ProbabilityTheory and Related Fields, 2006.[3] D. Bertsimas and A. Thiele. Robust and data-driven optimization: Modern decision-making underuncertainty. In Tutorials on Operations Research, 2006.[4] O. Besbes, R. Philips, and A. Zeevi. Testing the validity of a demand model: An operationsperspective, 2007.[5] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals ofStatistics, 2007.[6] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and otherlearning applications. Information and Computation,1992.[7] K. Kim and N. Timm. Univariate and Multivariate General Linear Models: Theory and Applications withSAS, 2006.[8] K. E. Muller and P.W. Stewart. Linear ModelTheory: Univariate, Multivariate, and Mixed Models, 2006.[9] R. S. Sutton and A. G. Barto. Reinforcement Learning:An Introduction, 1998.[10] R.Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, 1996.

6 Extensions

References

K

kkk

K

k kku

r xrGGxrulxu1

21

11 21,minarg)( )(

directed regression - stanford...

Documents