Download - Intro a Aprendizaje Predictivo
-
8/11/2019 Intro a Aprendizaje Predictivo
1/64
-
8/11/2019 Intro a Aprendizaje Predictivo
2/64
2
Introduction toPredictive Learning
Electrical and Computer Engineering
LECTURE SET 4
Statistical Learning Theory
-
8/11/2019 Intro a Aprendizaje Predictivo
3/64
3
OUTLINE of Set 4
Objectives and Overview Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM VC-dimension
Generalization Bounds Structural Risk Minimization (SRM)
Summary and Discussion
-
8/11/2019 Intro a Aprendizaje Predictivo
4/64
4
ObjectivesProblems with philosophical approaches
- lack quantitative description/ characterization ofideas;- no real predictive power (as in Natural Sciences)
- no agreement on basic definitions/ concepts (as inNatural Sciences)
Goal: to introduce Predictive Learning as a scientificdiscipline
-
8/11/2019 Intro a Aprendizaje Predictivo
5/64
5
Characteristics of Scientific Theory
Problem setting Solution approach
Math proofs (technical analysis) Constructive methods
Applications
Note: Problem Setting and Solution Approach areindependent (of each other)
-
8/11/2019 Intro a Aprendizaje Predictivo
6/64
6
History and Overview SLT aka VC-theory (Vapnik-Chervonenkis) Theory for estimating dependencies from finite
samples ( predictive learning setting )
Based on the risk minimization approach All main results originally developed in 1970s for
classification (pattern recognition) why?but remained largely unknown
Recent renewed interest due to practical success ofSupport Vector Machines (SVM)
-
8/11/2019 Intro a Aprendizaje Predictivo
7/64
7
History and Overview(contd)
MAIN CONCEPTUAL CONTRIBUTIONS Distinction between problem setting, inductive principle
and learning algorithms
Direct approach to estimation with finite data ( KIDprinciple ) Math analysis of ERM (standard inductive setting) Two factors responsible for generalization:
- empirical risk (fitting error)
- complexity(capacity) of approximating functions
-
8/11/2019 Intro a Aprendizaje Predictivo
8/64
8
Importance of VC-theory
Math results addressing the main question :- under what general conditions the ERM approach leads to(good) generalization?
New approach to induction:
Predictive vs generative modeling (in classical statistics)
Connection to philosophy of science- VC-theory developed for binary classification (pattern
recognition) ~ the simplest generalization problem- natural sciences: from observations to scientific law VC-theoretical results can be interpreted using general
philosophical principles of induction, and vice versa.
-
8/11/2019 Intro a Aprendizaje Predictivo
9/64
-
8/11/2019 Intro a Aprendizaje Predictivo
10/64
-
8/11/2019 Intro a Aprendizaje Predictivo
11/64
11
Empirical Risk Minimization
ERM principle in model-based learning Model parameterization: f (x, w) Loss function: L( f (x, w),y)
Estimate risk from data: Choose w* that minimizes Remp
Statistical Learning Theory developed fromthe theoretical analysis of ERM principleunder finite sample settings
n
iiiemp y f Ln R 1 )),,((
1)( wxw
-
8/11/2019 Intro a Aprendizaje Predictivo
12/64
12
Probabilistic Modeling vs ERM
-
8/11/2019 Intro a Aprendizaje Predictivo
13/64
13
Probabilistic Modeling vs ERM: Example
-2 0 2 4 6 8 10-6
-4
-2
0
2
4
6
8
10
x1
x 2
Known class distribution optimal decision boundary
-
8/11/2019 Intro a Aprendizaje Predictivo
14/64
14
Probabilistic Approach
Estimate parameters of Gaussian class distributions , andplug them into quadratic decision boundary
-2 0 2 4 6 8 10-6
-4
-2
0
2
4
6
8
10
x1
x 2
-
8/11/2019 Intro a Aprendizaje Predictivo
15/64
15
ERM Approach
Quadratic and linear decision boundary estimated viaminimization of squared loss
-2 0 2 4 6 8 10-6
-4
-2
0
2
4
6
8
10
x1
x 2
-
8/11/2019 Intro a Aprendizaje Predictivo
16/64
16
Estimation of multivariate functions Is it possible to estimate a function from finite data?
Simplified problem: estimation of unknown continuousfunction from noise-free samples Many results from function approximation theory :
To estimate accurately a d -dimensional function one needsO(n^^d) data points
For example, if 3 points are needed to estimate 2-nd orderpolynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space.
Similar results in signal processing Never enough data points to estimate multivariate
functions in most practical applications (image recognition,genomics etc.)
For multivariate function estimation, the number of freeparameters increases exponentially with problemdimensionality (the Curse of Dimensionality)
-
8/11/2019 Intro a Aprendizaje Predictivo
17/64
-
8/11/2019 Intro a Aprendizaje Predictivo
18/64
18
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM VC-dimension
Generalization Bounds Structural Risk Minimization (SRM)
Summary and Discussion
-
8/11/2019 Intro a Aprendizaje Predictivo
19/64
19
Keep-It-Direct Principle The goal of learning is generalization rather than
estimation of true function (system identification)
Keep-It-Direct Principle (Vapnik, 1995)
Do not solve an estimation problem of interest bysolving a more general (harder) problem as anintermediate step
Good predictive model reflects some properties ofunknown distribution P(x,y)
Since model estimation with finite data is ill-posed, oneshould never try to solve a more general problem thanrequired by given application
Importance of formalizing application requirementsas a learning problem.
min ,y) ,w)) dP( Loss(y, f( xx
-
8/11/2019 Intro a Aprendizaje Predictivo
20/64
20
Learning vs System Identification Consider regression problem
where unknown target function
Goal 1 : Prediction
Goal 2 : Function Approximation (system identification)
or
Admissible models : algebraic polynomials Purpose of comparison : contrast goals (1) and (2)NOTE: most applications assume Goal 2, i.e.
Noisy Data ~ true signal + noise
)(x g y)/()( xx y E g
min),()),(()( 2 ydP f y R xwxw
min))(),(()( 2 xxwxw d g f Rmin)/(),( xwx y E f
-
8/11/2019 Intro a Aprendizaje Predictivo
21/64
-
8/11/2019 Intro a Aprendizaje Predictivo
22/64
-
8/11/2019 Intro a Aprendizaje Predictivo
23/64
23
Regression estimates (2 typical realizations of data):
Dotted line ~ estimate obtained using predictive learningDashed line ~ estimate via function approximation setting
-
8/11/2019 Intro a Aprendizaje Predictivo
24/64
24
Conclusion The goal of prediction (1) is different (less
demanding) than the goal of estimating thetrue target function (2) everywhere in theinput space.
The curse of dimensionality applies to systemidentification setting (2), but may not holdunder predictive setting (1).
Both settings coincide if the input distributionis uniform (i.e., in signal and image denoisingapplications)
-
8/11/2019 Intro a Aprendizaje Predictivo
25/64
25
Philosophical Interpretation of KIDInterpretation of predictive models Realism ~ objective truth (hidden in Nature) Instrumentalism ~ creation of human mind
(imposed on the data) favored by KID
Objective Evaluation still possible (via predictionrisk reflecting application needs) NaturalScience
Methodological implications
Importance of good learning formulations(asking the right question)
Accounts for 80% of success in applications
-
8/11/2019 Intro a Aprendizaje Predictivo
26/64
26
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM VC-dimension
Generalization Bounds Structural Risk Minimization (SRM)
Summary and Discussion
-
8/11/2019 Intro a Aprendizaje Predictivo
27/64
27
VC-theory has 4 parts:
1. Analysis of consistency/convergence ofERM
2. Generalization bounds3. Inductive principles (for finite samples)
4. Constructive methods (learningalgorithms) for implementing (3)
NOTE: (1) (2) (3) (4)
1
1( ) ( , ( )) min
n
emp i ii
R L y f n
x ,
-
8/11/2019 Intro a Aprendizaje Predictivo
28/64
28
Consistency/Convergence of ERM
Empirical Risk known but Expected Risk unknown Asymptotic consistency requirement :
under what (general) conditions models providing minEmpirical Risk will also provide min Prediction Risk ,when the number of samples grows large?
Why asymptotic analysis is needed?
- helps to develop useful concepts- necessary and sufficient conditions ensure that VC-theory is general and can not be improved
-
8/11/2019 Intro a Aprendizaje Predictivo
29/64
-
8/11/2019 Intro a Aprendizaje Predictivo
30/64
30
Conditions for Consistency of ERM Main insight : consistency is not possible without
restricting the set of possible models Example: 1-nearest neighbor classification method.
- is it consistent ?
Consider binary decision functions (classification) How to measure their flexibility, or ability to explain the
training data (for binary classification)? This complexity index for indicator functions:
- is independent of unknown data distribution ;- measures the capacity of a set of possible models ,rather than characteristics of the true model
-
8/11/2019 Intro a Aprendizaje Predictivo
31/64
31
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM VC-dimension
Generalization Bounds Structural Risk Minimization (SRM)
Summary and Discussion
-
8/11/2019 Intro a Aprendizaje Predictivo
32/64
32
SHATTERING Linear indicator functions : can split 3 data points in 2D in
all 2^^3 = 8 possible binary partitions
If a set of n samples can be separated by a set of
functions in all 2^^n possible ways, this sample is said tobe shattered (by the set of functions) Shattering ~ a set of models can explain a given sample of
size n (for all possible labelings)
-
8/11/2019 Intro a Aprendizaje Predictivo
33/64
33
VC DIMENSION Definition : A set of functions has VC-dimension h is there exist h
samples that can be shattered by this set of functions, but thereare no h+1 samples that can be shattered
VC-dimension h=3 ( h=d+1 for linear functions ) VC-dim. is a positive integer (combinatorial index) What is VC-dim. of 1-nearest neighbor classifier ?
-
8/11/2019 Intro a Aprendizaje Predictivo
34/64
34
VC-dimension and Consistency of ERM VC-dimension is infinite if a sample of size n can be split
in all 2^^n possible ways
(in this case, no valid generalization is possible) Finite VC-dimension gives necessary and sufficient
conditions for:(1) consistency of ERM-based learning
(2) fast rate of convergence(these conditions are distribution-independent)
Interpretation of the VC-dimension via falsifiability:
functions with small VC-dim can be easily falsified
-
8/11/2019 Intro a Aprendizaje Predictivo
35/64
35
VC-dimension and Falsifiability
A set of functions has VC-dimension h if(a) It can explain (shatter) a set of h samples~ there exists h samples that cannot falsify itand(b) It can not shatter h+1 samples~ any h+1 samples falsify this set
Finiteness of VC-dim is necessary and sufficient condition for generalization(for any learning method based on ERM)
-
8/11/2019 Intro a Aprendizaje Predictivo
36/64
36
Recall Occams Razor: Main problem in predictive learning
- Complexity control (model selection)- How to measure complexity ?
Interpretation of Occams razor (in Statistics):
Entities ~ model parametersComplexity ~ degrees-of-freedomNecessity ~ explaining (fitting) available data
Model complexity = number of parameters (DoF) Consistent with classical statistical view:
learning = function approx. / density estimation
-
8/11/2019 Intro a Aprendizaje Predictivo
37/64
37
Philosophical Principle of VC-falsifiability
Occams Razor : Select the model that explainsavailable data and has the smallest number offree parameters (entities)
VC theory : Select the model that explainsavailable data and has low VC-dimension (i.e. can
be easily falsified ) New principle of VC-falsifiability
-
8/11/2019 Intro a Aprendizaje Predictivo
38/64
38
Calculating the VC-dimension
How to estimate the VC-dimension (for a given set offunctions)?
Apply definition (via shattering) to derive analyticestimates works for simple sets of functions
Generally, such analytic estimates are not possible forcomplex nonlinear parameterizations (i.e., for practicalmachine learning and statistical methods)
-
8/11/2019 Intro a Aprendizaje Predictivo
39/64
39
Example: VC-dimension of spherical indicator functions .
Consider spherical decision surfaces in a d -dimensional x-space,parameterized by center c and radius r parameters:
In a 2-dim space (d=2) there exists 3 points that can be shattered,but 4 points cannot be shattered h=3
2 2, , ( ) f r I r x c x c
-
8/11/2019 Intro a Aprendizaje Predictivo
40/64
40
Example: VC-dimension of a linear combination of fixed basis functions (i.e.polynomials, Fourier expansion etc.)Assuming that basis functions are linearly independent, the VC-dim equals
the number of basis functions (free parameters). Example: single parameter but infinite VC-dimension f x ,w I sin wx 0
-
8/11/2019 Intro a Aprendizaje Predictivo
41/64
41
Example: Wide linear decision boundaries
Consider linear functions such that the distance btwn D(x) and
the closest data sample is larger than given valueThen VC-dimension depends on the width parameter, ratherthan d (as in linear models):
b D )( xwx
1,min 22
d R
h
-
8/11/2019 Intro a Aprendizaje Predictivo
42/64
-
8/11/2019 Intro a Aprendizaje Predictivo
43/64
43
VC-dimension vs number of parameters
VC-dimension can be equal to DoF (number ofparameters)Example: linear estimators
VC-dimension can be smaller than DoFExample: penalized estimators
VC-dimension can be larger than DoFExample: feature selection
sin (wx)
-
8/11/2019 Intro a Aprendizaje Predictivo
44/64
44
VC-dimension for Regression Problems VC-dimension was defined for indicator functions
Can be extended to real-valued functions, i.e.third-order polynomial for univariate regression:
linear parameterization VC-dim = 4
Qualitatively, the VC-dimension ~ the ability to fit (orexplain) finite training data for regression.
This can be also related to falsifiability
b xw xw xwb x f 12233),,( w
-
8/11/2019 Intro a Aprendizaje Predictivo
45/64
-
8/11/2019 Intro a Aprendizaje Predictivo
46/64
46
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting
Keep-It-Direct Principle
Analysis of ERM VC-dimension
Generalization Bounds Structural Risk Minimization (SRM)
Summary and Discussion
ll f
-
8/11/2019 Intro a Aprendizaje Predictivo
47/64
47
Recall consistency of ERM
Two Types of VC-bounds:(1) How close is the empirical risk to the true risk
(2) How close is the empirical risk to the minimal possible risk ?
* emp R * R
-
8/11/2019 Intro a Aprendizaje Predictivo
48/64
48
Generalization Bounds Bounds for learning machines (implementing ERM)
evaluate the difference btwn (unknown) risk and knownempirical risk, as a function of sample size n and generalproperties of admissible models (their VC-dimension)
Classification: the following bound holds with probability offor all approximating functions
where is called the confidence interval Regression: the following bound holds with probability of
for all approximating functions
where
nhn R R R empemp /ln,/),()()( 1
c R R emp 1/)()(
1
nh
,ln n
a1
h lna 2 n
h1 ln / 4 n
-
8/11/2019 Intro a Aprendizaje Predictivo
49/64
49
Practical VC Bound for regression Practical regression bound can be obtained by setting
the confidence level and theoretical constants:
can be used for model selection (examples given later) Compare to analytic bounds (SC, FPE) in Lecture Set 2 Analysis (of denominator) shows that
h < 0.8 n for any estimatorIn practice:
h < 0.5 n for any estimator
1
2
lnln1)()(
n
n
n
h
n
h
n
hh Rh R emp
1,/4 nmin
-
8/11/2019 Intro a Aprendizaje Predictivo
50/64
50
VC Regression Bound for model selection VC-bound can be used for analytic model selection
(if the VC-dimension is known)
Example: polynomial regression for estimating Sine_Squared targetfunction from 25 noisy samples
Optimal model found:6-th degree polynomial
(no resampling needed)
0 0.2 0.4 0.6 0.8 1-0.5
0
0.5
1
1.5
x
y
M d li g i ith i [0 1] i l g i
-
8/11/2019 Intro a Aprendizaje Predictivo
51/64
51
Modeling pure noise with x in [0,1] via poly regressionsample size n=30, noise 1
Comparison of different model selection methods:- prediction risk (MSE)- selected DoF (~ h)
fpe gcv vc cv10
-10
100
1010
R i s k ( M S E )
fpe gcv vc cv0
10
20
30
D e g r e e o f
F r e e d o m
-
8/11/2019 Intro a Aprendizaje Predictivo
52/64
-
8/11/2019 Intro a Aprendizaje Predictivo
53/64
53
Structural Risk Minimization Analysis of generalization bounds
suggests that when n/h is large , the term is small
This leads to parametric modeling approach (ERM) When n/h is not large (say, less than 20), both terms in the right-
hand side of VC- bound need to be minimized make the VC-dimension a controlling variable
SRM = formal mechanism for controlling model complexitySet of admissible models has a nested structure
such that structure formally defines complexity ordering
nhn R R R empemp /ln,/),()()(
)(~)( emp R R
S 1 S 2 ... S k ...),( x f
h1 h2 ... hk ...
S l i k i i i i
-
8/11/2019 Intro a Aprendizaje Predictivo
54/64
54
Structural Risk Minimization An upper bound on the true risk and the empirical risk, as a
function of VC-dimension h (for fixed sample size n)
SRM ERM d li
-
8/11/2019 Intro a Aprendizaje Predictivo
55/64
55
SRM vs ERM modeling
-
8/11/2019 Intro a Aprendizaje Predictivo
56/64
56
SRM Approach Use VC-dimension as a controlling parameter for minimizing
VC bound:
Two general strategies for implementing SRM :
1. Keep fixed and minimize(most statistical and neural network methods)
2. Keep fixed and minimize(Support Vector Machines)
hn R R emp /)()( hn R R emp /)()(
)( emp R
)( emp R
hn /
hn /
-
8/11/2019 Intro a Aprendizaje Predictivo
57/64
57
Common SRM structures Dictionary structure
A set of algebraic polynomials
is a structure since
More generallywhere is a set of basis functions ( dictionary ).
The number of terms (basis functions) m specifies an elementof a structure.For fixed basis fcts, VC-dim ~ number of parameters
mi
iim xwb x f
0
, w f 1 f 2 .... f k ....
m
iiim , g wb f
0,, vxVwx
g x ,v i
wi
-
8/11/2019 Intro a Aprendizaje Predictivo
58/64
58
Common SRM structures Feature selection (aka subset selection)
Consider sparse polynomials of degree m:for m=1:for m=2:
etc. Each monomial is a feature. The goal is to select a set of m
features providing min. empirical risk (MSE)This is a structure since
More generally ,
where m basis fcts are selected from a (large) set of M fcts
Note: nonlinear optimization, VC-dimension is unknown
.... f .... f f m21
f m x , w ,V wi g x ,v i i 0
m
1),,,( 11k wxbk bw x f
2121212 ),,,,(
k k xw xwbk k b x f w
-
8/11/2019 Intro a Aprendizaje Predictivo
59/64
59
Common SRM structures Penalization
Consider algebraic polynomial of fixed degreewhere
For each (positive) value c this set of functions specifies anelement of a structureMinimization of empirical risk (MSE) on each element of astructure is a constrained minimization problem
This optimization problem can be equivalently stated asminimization of the penalized empirical risk functional :
where the choice of
Note: VC-dimension is unknown
10
0
,i
ii xw x f w k c
2w
S k f x ,w , w 2 c k
...321 ccc
S k
R pen , k Remp k w 2 k k c~
-
8/11/2019 Intro a Aprendizaje Predictivo
60/64
60
Example: SRM structures for regression Regression data set
x-values~ uniformly sampled in [0,1]y-values ~ target fctadditive Gaussian noise with st. dev 0.05
Experimental set-up training set ~ 40 samplesvalidation set ~ 40 samples (for model selection)
SRM structures defined on algebraic polynomials- dictionary (polynomial degrees 1 to 10)- penalization (fixed degree-10 polynomial) - sparse polynomials (degree 1 to 5)
x x x y 5.02.0)2sin(8.0 2
Estimated models using different SRM structures:
-
8/11/2019 Intro a Aprendizaje Predictivo
61/64
61
Estimated models using different SRM structures:- dictionary- penalization lambda=1.013e-005
- sparse polynomial Visual results : target fct~ red line, feature selection~ black solid,
dictionary ~ green, penalization ~ yellow line
5432 9565.553952.1587679.1632162.684198.64078.0
x x x x x y
432 2736.191772.417337.226186.0
x x x y
0 0.2 0.4 0.6 0.8 1 -1.5
-1
-0.5
0
0.5
1
x
y
-
8/11/2019 Intro a Aprendizaje Predictivo
62/64
62
SRM Summary SRM structure ~ complexity ordering on a set of
admissible models (approximating functions) Many different structures on the same set of
approximating functions (possible models) How to choose the best structure?
- depends on application data- VC theory cannot provide answer
SRM = mechanism for complexity control
- selecting optimal complexity for a given data set- new measure of complexity: VC-dimension- model selection using analytic VC-bounds
-
8/11/2019 Intro a Aprendizaje Predictivo
63/64
-
8/11/2019 Intro a Aprendizaje Predictivo
64/64
Summary and Discussion: VC-theory Methodology
- learning problem setting (KID principle) - concepts (risk minimization, VC-dimension, structure)
Interpretation/ evaluation of existing methods Model selection using VC-bounds New types of inference (TBD later) What theory can not do :
- provide formalization (for a given application)
- select good structure - always a gap between theory and applications