lecture 3: statistical decision theory (part ii)hzhang/math574m/2019... · part ii: learning theory...
TRANSCRIPT
IntroductionPart II: Learning Theory for Supervised Learning
Lecture 3: Statistical Decision Theory (Part II)
Hao Helen Zhang
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Outline of This Note
Part I: Statistics Decision Theory (Classical StatisticalPerspective - “Estimation”)
loss and riskMSE and bias-variance tradeoffBayes risk and minimax risk
Part II: Learning Theory for Supervised Learning (MachineLearning Perspective - “Prediction”)
optimal learnerempirical risk minimizationrestricted estimators
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 2 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Supervised Learning
Any supervised learning problem has three components:
Input vector X ∈ Rd .
Output Y , either discrete or continuous valued
Probability framework
(X ,Y ) ∼ P(X ,Y ).
The goal is to estimate the relationship between X and Y ,described by f (X ), for future prediction and decision-making.
For regression, f : Rd → R
For K -class classification, f : Rd → {1, · · · ,K}
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 3 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Learning Loss Function
Similar to the learning theory, we use a learning loss function L
to measure the discrepancy Y and f (X )
to penalize the errors for predicting Y .
Examples:
squared error loss (used in mean regression)
L(Y , f (X )) = (Y − f (X ))2
absolute error loss (used in median regression)
L(Y , f (X )) = |Y − f (X )|
0-1 loss function (used in binary classification)
L(Y , f (X )) = I (Y 6= f (X ))
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 4 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Risk, Expected Prediction Error (EPE)
The risk of f is
R(f ) = EX ,Y L(Y , f (X )) =
∫L(y , f (x))dP(x , y)
In machine learning, R(f ) is called Expected Prediction Error(EPE).
For squared error loss,
R(f ) = EPE (f ) = EX ,Y [Y − f (X )]2
For 0-1 loss
R(f ) = EPE (f ) = EX ,Y I (Y 6= f (X )) = Pr(Y 6= f (X )),
which is the probability of making a mistake.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 5 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Optimal Learner
We seek f which minimizes the average (long-term) loss overthe entire population.
Optimal Learner: The minimizer of R(f ) is
f ∗ = arg minf
R(f ).
The solution f ∗ depends on the loss L.
f ∗ is not achievable without knowing P(x , y) or the entirepopulation
Bayes Risk: The optimal risk, or risk of the optimal learner
R(f ∗) = EX ,Y (Y − f ∗(X ))2.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 6 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
How to Find The Optimal Learner
Note that
R(f ) = EX ,Y L(Y , f (X )) =
∫L(y , f (x))dP(x , y)
= EX{EY |XL(Y , f (X ))
}One can do the following,
for each fixed x , find f ∗(x) by solving
f ∗(x) = arg minc
EY |X=xL(Y , c)
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 7 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Example 1: Regression with Squared Error Loss
For regression, consider L(Y , f (X )) = (Y − f (X ))2. The risk is
R(f ) = EX ,Y (Y − f (X ))2 = EXEY |X ((Y − f (X ))2|X ).
For each fixed x , the best learner solves the problem
minc
EY |X [(Y − c)2|X = x ].
The solution is E (Y |X = x), so
f ∗(x) = E (Y |X = x) =
∫ydP(y |x).
So the optimal leaner is f ∗(X ) = E (Y |X ).
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 8 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Decomposition of EPE: Under Squared Error Loss
For any learner f , its EPE under the squared error loss isdecomposed as
EPE (f ) = EX ,Y [Y − f (X )]2 = EXEY |X ([Y − f (X )]2|X )
= EXEY |X ([Y − E (Y |X ) + E (Y |X )− f (X )]2|X}= EXEY |X ([Y − E (Y |X )]2|X ) + EX [f (X )− E (Y |X )]2
= EX ,Y [Y − f ∗(X )]2 + EX [f (X )− f ∗(X )]2
= EPE (f ∗) + MSE
= Bayes Risk + MSE
The Bayes risk (gold standard) gives a lower bound for any risk.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 9 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Example 2: Classification with 0-1 Loss
In binary classification, assume Y ∈ {0, 1}. The learning rulef : Rd → {0, 1}.
Using 0-1 lossL(Y , f (X )) = I (Y 6= f (X )),
the expected prediction error is
EPE (f ) = EI (Y 6= f (X ))
= Pr(Y 6= f (X ))
= EX [P(Y = 1|X )I (f (X ) = 0) + P(Y = 0|X )I (f (X ) = 1)]
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 10 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Bayes Classification Rule for Binary Problems
For each fixed x , the minimizer of EPE (f ) is given by
f ∗(x) =
{1 if P(Y = 1|X = x) > P(Y = 0|X = x)
0 if P(Y = 1|X = x) < P(Y = 0|X = x)
It is called the Bayes classifier, denoted by φB .
The Bayes rule can be written as
φB(x) = f ∗(x) = I [P(Y = 1|X = x) > 0.5].
To obtain the Bayes rule, we need to know P(Y = 1|X = x),or whether P(Y = 1|X = x = x) is larger than 0.5 or not.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 11 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Probability Framework
Assume the training set
(X 1,Y1), ..., (X n,Yn) ∼ i.i.d P(X ,Y ).
We aim to find the best function (optimal solution) from themodel class F for decision making.
However, the optimal learner f ∗ is not attainable withoutknowledge of P(x , y).
The training set (x1, y1), ..., (xn, yn) generated from P(x , y)carry information about P(x , y)
We approximate the integral over P(x , y) by the average overthe training samples.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 12 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Empirical Risk Minimization
Key Idea: We can not compute R(f ) = EX ,Y L(Y , f (X )), but cancompute its empirical version using the data!
Empirical Risk: also known as the training error
Remp(f ) =1
n
n∑i=1
L(yi , f (xi )).
A reasonable estimator f should be close to f ∗, or converge tof ∗ when more data are collected.
Using law of large numbers, Remp(f )→ EPE (f ) as n→∞.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 13 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Learning Based on ERM
We construct the estimator of f ∗ by
f̂ = arg minf
Remp(f ).
This principle is called Empirical Risk Minimization. (ERM)
For regression,
the empirical risk is known as Residual Sum of Squares (RSS)
Remp(f ) = RSS(f ) =n∑
i=1
[yi − f (xi )]2.
ERM gives the least mean squares:
f̂ ols = arg minf
RSS(f ) = arg minf
n∑i=1
[yi − f (xi )]2.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 14 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Issues with ERM
Minimizing Remp(f ) over all functions is not proper.
Any function passing through all (x i , yi ) has RSS = 0.Example:
f1(x) =
{yi if x = x i for some i = 1, · · · , n1 otherwise.
It is easy to check that RSS(f1) = 0.
However, it does not amount to any form of learning. For thefuture points not equal to x i , it will always predict y = 1.This phenomenon is called “overfitting”.
Overfitting is undesired, since a small Remp(f ) does not implya small R(f ) =
∫L(y , f (x))dP(x , y).
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 15 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Class of Restricted Estimators
We need to restrict the estimation process within a set of functionsF , to control complexity of functions and hence avoid over-fitting.
Choose a simple model class F .
f̂ols = arg minf
RSS(f ∈ F) = arg minf ∈F
n∑i=1
[yi − f (xi )]2.
Choose a large model class F , but use a roughness penalty tocontrol model complexity: Find f ∈ F to minimize
f̂ = arg minf ∈F
Penalized RSS ≡ RSS(f ) + λJ(f ).
The solution is called regularized empirical risk minimizer.
How should we choose the model F? Simple one or complex one?Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 16 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Parametric Learners
Some model classes can be parametrized as
F = {f (X ;α), α ∈ Λ},
where α is the index parameter for the class.
Parametric Model Class: The form of f is known except forfinitely many unknown parameters
Linear Models: linear regression, linear SVM, LDA
f (X ;β) = β0 + X ′β1, β1 ∈ Rd ,
f (X ;β) = β0 + sin(X1)β1 + β2X22 , X ∈ R2
Nonlinear models
f (x ;β) = β1xβ2
f (X ;β) = β1 + β2 expβ3Xβ4
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 17 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Advantages of Linear Models
Why are linear models in wide use?
If linear assumptions are correct, it is more efficient thannonparametric fits.
convenient and easy to fit
easy to interpret
providing the first-order Taylor approximation to f (X)
Example: In the linear regression, F = {all linear functions of x}.
The ERM is the same as solving the ordinary least squares,
(β̂ols0 , β̂ols
) = arg min(β0,β)
n∑i=1
[yi − β0 − x′iβ]2.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 18 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Drawback of Parametric Models
We never know whether linear assumptions are proper or not.
Not flexible in practice, having all orders of derivatives and thesame function form everywhere
Individual observations have large influences on remote partsof the curve
The polynomial degree can not be controlled continuously
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 19 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
More Flexible Learners
Nonparametric Model Class F :the function form f is unspecified, and the parameter set could beinfinitely dimensional.
F = {all continuous functions of x}.F = the second-order Sobolev space over [0, 1].
F is the space spanned by a set of Basis functions (dictionaryfunctions)
F = {f : fθ(x) =∞∑
m=1
θmhm(x)}.
Kernel methods and local regression:
F contains all piece-wise constant functions,or, F contains all piece-wise linear functions.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 20 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Example of Nonparametric Methods
For regression problems,
splines, wavelets, kernel estimators, locally weightedpolynomial regression, GAM
projection pursuit, regression trees, k-nearest neighbors
For classification problems,
kernel support vector machines, k-nearest neighbors
classification trees ...
For density estimation,
histogram, kernel density estimation
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 21 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Semiparametric Models: a compromise between linear models andnonparametric models
partially linear models
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 22 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Nonparametric Models
Motivation: The underlying regression function is so complicatedthat no reasonable parametric models would be adequate.Advantages:
NOT assume any specific form.
Assume less, thus discover more
infinite parameters: not no parameters
Local more relying on the neighbors
Outliers and influential points do not affect the results
Let the data speak for themselves and show us theappropriate functional form!
Price we pay: more computational effort, lower efficiency andslower convergence rate (if linear models happen to be reasonable)
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 23 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Nonparametric vs Parametric Models
They are not mutually exclusive competitors
A nonparametric regression estimate may suggest a simpleparametric model
In practice, parametric models are preferred for simplicity
Linear regression line is infinitely smooth fit
method # parameters estimate outliers
parametric finite global sensitivenonparametric infinite local robust
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 24 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
linear fit
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
2−order polynomial fit
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
3−order polynomial fit
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
smoothing spline (lambda=0.85)
x
y
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 25 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Interpretation of Risk
Let F denote the restricted model space.
Denote f ∗ = arg minEPE (f ), the optimal learner. Theminimization is taken over all measurable functions.
Denote f̃ = arg minF EPE (f ), the best learner in therestricted space F .
Denote f̂ = arg minF Remp(f ), the learner based on limiteddata in the restricted space F .
EPE (f̂ )− EPE (f ∗)
= [EPE (f̂ )− EPE (f̃ )] + [EPE (f̃ )− EPE (f ∗)]
= estimation error + approximation error.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 26 / 27
IntroductionPart II: Learning Theory for Supervised Learning
Loss, Risk, and Optimal LearnerLearning Process: ERMRestricted Model Class F
Estimation Error and Approximation Error
The gap EPE (f̂ )− EPE (f̃ ) is called the estimation error,which is due to the limited training data.
The gap EPE (f̃ )− EPE (f ∗) is called the approximation error.It is incurred from approximating f ∗ with a restricted modelspace or via regularization, since it is possible that f ∗ /∈ F .
This decomposition is similar to the bias-variance trade-off, with
estimation error playing the role of variance
approximation error playing the role of bias.
Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 27 / 27