directed regression - stanford...
TRANSCRIPT
When used to guide decisions, linear regressionanalysis typically involves estimation of regressioncoefficients and their subsequent use to makedecisions. We propose directed regression, anefficient algorithm that accounts for the decisionobjective when computing regression coefficients.We also develop a theory that motivates thisalgorithm.
Suppose we are given a set of training data pairs O ={(x(1), y(1)), …, (x(N), y(N))}. Each nth data pair iscomprised of feature vectors x1
(n), …, xK(n) ∈ RM and
a vector y(n) ∈ RM of response variables. We wouldlike to compute regression coefficients r ∈ RK sothat
Consider a setting where each time we observe
∈
K
k kk xrxy1
]|[E
Directed RegressionYi-hao Kao, Benjamin Van Roy, and Xiang Yan
Stanford University
• Ordinary least squares (OLS) is a conventional approach to computing regressioncoefficients. This would produce a coefficient vector
• Empirical optimization (EO) minimizes empirical loss on the training data:
• Directed regression (DR) takes a convex combination of rOLS and rEO
where the parameter λ ∈ [0,1] is computed via cross-validation. OLS and EO can beviewed as two extremes, while DR is designed to seek an optimal tradeoff betweenthem.
2
1 1
)()(
R
OLS minarg
N
n
K
k
nkk
n
rxryr
K
N
n
nnr
ryxulr
K 1
)()(
R
EO )),((minarg
EOOLSDR )1( rrr
We sample each data pair (x(n), y(n)) by and xk(n) = Ckϕ(n) for
k=1,2, …, K. The vector ϕ(n) can be viewed as a sample from an underlyinginformation space, and the matrices , …, extract feature vectors from ϕ . Note
)(1
)(~ nP
in
ii(n) wCry
Consider a generative model
where samples z(1), …, z(n) represent missing features. The priordistributions of coefficients are and .We define an augmented training set = {(x(1), z(1), y(1)), …, (x(N), z(N),y(N))} and consider optimal regression coefficients
Our primary interest is in cases where prior knowledge about thecoefficients r* is weak so . Along with some minor assumptionson x and z, we have the following theorems.
)(
1
)(*
1
)(* nJ
j
njj
K
k
nkk
(n) wzrxry
Theorem 1 andOLS
0ˆlim rr
]|)),(([EminargˆR
Oyxulr rr K
),0(~ 2*Kr IΝr ),0(~ 2*
JIΝr
O
r
EOˆlim rr
Theorem 2 For any other ,EOOLS)1(ˆ rrr 2
1 Introduction
2 Regression for Decision
3 Algorithm
4 Computational Results
5 Theoretical Analysis
Consider a setting where each time we observefeature vectors x1 , …, xK we will have to select adecision u ∈ RL before observing the responsevector y. The decision incurs quadratic loss
We aim to minimize expected loss, assuming thatthe conditional expectation of y given x is indeedthe linear combination described above. As such, weselect a decision
The question is how best to compute theregression coefficients r for this purpose.
yGuuGuyul 2T
1T),(
Example Consider an Internet banner ad campaign that targets M classes of customers. An average revenue of ym is received per customer of class m that the campaign reaches. This quantity is random and influenced by K observable factors x1m , …, xKm. For each mth class, we pay γmum
2 to reach um customers. It is natural to predict the response vector y using a linear combination of factors xk. The goal is to maximize expected revenue less advertising costs. This gives rise to a loss function
M
mmmmm yuuyul
1
2 ).(),(
information space, and the matrices C1, …, CP extract feature vectors from ϕ(n). Notethat although response variables depend on P feature vectors, only K ≤ P are used inthe regression model. We let P=60 and carried out two sets of experiments. In thefirst set, we fix K=50. Figure 1(a) plots the average excess losses for different N. Inthe second set, we fix N=20. Figure1(b) plots the results for different K. Figure 2further plots the average values of λ selected by cross-validation.
Figure1: (a)(left) Excess losses delivered by OLS, EO, and DR, for different numbers of training samples.(b)(right) Excess losses for different numbers features.
Figure 2: (a)(left) The average values of selected λ for different numbers training samples. (b)(right) The average values of selected λ for different numbers of the features.
where)1(ˆ rrr
22
2
wNN
Our work suggests that there can be significant gains in broader problemsettings from a tighter coupling between machine learning and decision-making. It will be interesting to explore• How to do this with other classes of models and objectives?• How to synthesize DR with feature subset selection methods?• How to extend DR to the multi-period case?• The effectiveness of cross-validation in optimizing λ?
[1] J.-Y. Audibert. Aggregated estimators and empirical complexity for least square regression. Annalesde l’Institut Henri Poincare Probability and Statistics, 2004.[2] P. L. Bartlett and S. Mendelson. Empirical minimization. ProbabilityTheory and Related Fields, 2006.[3] D. Bertsimas and A. Thiele. Robust and data-driven optimization: Modern decision-making underuncertainty. In Tutorials on Operations Research, 2006.[4] O. Besbes, R. Philips, and A. Zeevi. Testing the validity of a demand model: An operationsperspective, 2007.[5] F. Bunea, A. B. Tsybakov, and M. H. Wegkamp. Aggregation for Gaussian regression. The Annals ofStatistics, 2007.[6] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and otherlearning applications. Information and Computation,1992.[7] K. Kim and N. Timm. Univariate and Multivariate General Linear Models: Theory and Applications withSAS, 2006.[8] K. E. Muller and P.W. Stewart. Linear ModelTheory: Univariate, Multivariate, and Mixed Models, 2006.[9] R. S. Sutton and A. G. Barto. Reinforcement Learning:An Introduction, 1998.[10] R.Tibshirani. Regression shrinkage and selection via the lasso. Journal of Royal Statistical Society, 1996.
6 Extensions
References
K
kkk
K
k kku
r xrGGxrulxu1
21
11 21,minarg)( )(