machine learning concepts
TRANSCRIPT
Machine Learning Concepts
Professor Marie RochSan Diego State University
Basic definitions & concepts
• Task – What the system is supposed to doe.g. ASR: f(input speech) list of words
• Performance measure – How well does it work?• Experience – How does the machine learn the task?
2
Types of experiences;how a learner learns…
• Supervised learning – Learn class conditional distributions:implies class labels are known
• Unsupervised learning – No labels are provided, learn P(x) and possibly group x’s into clusters.
• Reinforcement learning – Learner actions are associated with payouts for actions in environment.
3
|( )P xω
Experience data set:Design matrix
4
1,1 1,2 1,3 1, 1
2,1 2
3,1 3
,1 ,2 ,3 ,
D
N N N N D N
x x x x yx yx y
x x x x y
…
…
…
…
label(supervised problems)feature vector
Learning sets of functions may be used if some features are missing. E.g. fn f0 for all features, f1 if feature 1 is missing, etc.
A Cardinal RuleOF MACHINE LEARNING
THOU SHALTNOT TEST ONTHY TRAININGDATA 5
Performance
• A metric that measures how well a learner is able to accomplish the task
• Metrics can vary significantly (more on these later), examples:– loss functions such as squared error– cross entropy
6
Partitioning data
• Training data – experience for learner• Test data – performance measurement• Evaluation data
– Only used once all adjustments are made– It is common to:
7
train test
adjust
• Adjustments are a form of training (see 5.3)
• Evaluation provides an independent test
Regression
Given a set of features and response, predict a response
8
IPCC 4th Assessment ReportClimate Change 2007
Observed changes in numberof warm days/night withextreme temperatures (normative 1961-1990)
Linear regressionA simple learning algorithm
Predict response from data
• w is the weight vector• Goal: Maximize performance on test set.
Learn w to minimize some criterion, e.g. mean squared error (MSE)
9
ˆ Ty w x= , , ˆN yw x∈ℜ ∈ℜ
( ) 22 test testtest 2
1
1 1ˆ ˆm
i ii
y y y yMm
SEm=
= − ≡ −∑
1
denotes the norm | | read Goodfellow et al. 2.5p
p pip
ix L x
∑
Linear regression
• Cannot estimate w from test data• Use the training data• Minimize
10
( ) 22 train traintrain 2
1
1 1ˆ ˆm
i ii
y ym
MS yE ym =
−= − ≡∑
For convenience, we will usually omit the variable descriptor train when describing training.
Linear regression
MSE minimized when gradient is zero
11
2
2
2
22
2
1 ˆ 0
ˆ 0
ˆ0 s
0
a
w
w
w
w
y ymy y
Xw y
MSE
y Xw
=
∇ =
∇ −
∇ =−
=− =∇
Linear regression
12
( ) ( )( )( )( )
( )( )( )
2
2
22
0
=0 norm in matrix notation
=0 transpose distributive over addition
=0 as (A Goodfellow et al. eqn 2.9
0
B)
w
Tw
T Tw
T T T Tw
T T T T T Tw
T T
w
B A
w Xw w y y Xw y
w
Xw y
Xw y Xw y L
Xw y Xw y
y Xw y
X y
X
X
∇ −
∇ − −
∇ − −
∇ − − =
− − + =
=
∇
∇ ( ) ( )( )
0 as
2 0
TT T T T T T T T T T
T T
T
T Tw
X X yw Xw y Xw y y w w y Xwy Xw y X
w X yX w y Xw y
− − + =
∇ − +
==
=
Linear regression
13
( )( )( )
( )
2
1
2 0
2 0 as independent of
2 0 as
2 2 0 derivative
matrix inv
T T T Tw
T T T Tw
T T T Tw
T
T T
T T
T T
w Xw y Xw y y
w Xw y Xw y y w
y Xw w X X X Xw
X Xw y XX Xw y X
w X
X Xw
X y X
X
X
−
∇ − + =
∇ − =
∇ − = =
− =
=
= 1erse A A I− =
These are referred to as the normal equations
Linear regression
14
Fig. 5.1 Goodfellow
et al.
Normal equations optimize w
Linear regression
Regression formula forces curve to pass through originRemove restriction:• Add bias term• To use normal equations,
use modified x• Last term of new weight
vector w is bias
15
ˆ Ty w x b= +
1
2
1
mod
D
xx
xx
=
Notes on learning
• A learner that performs well on unseen data is said to generalize well.
• When will learning on training data produce good generalization?– Training and test data drawn from the same distribution– Large enough training set to learn the distribution
16
Underfitting & Overfitting
• UnderfitModel cannot learn training data well
• OverfitModel does not generalize well
• These properties are related to model capacity
17
Capacity
What kind of functions can we learn?
18
Fig. 5.2 Goodfellow
et al.
1 00 00 1y w x w x= + 0 1
2 1 00 0 02y w w xx w x= ++
00
ˆ Di
D
i
iy w x −
=
=∑
D=9
Here, model order affects capacity
Shattering points
Points are shattered if a classifier can separate them regardless of their binary label
19
Hastie et al., 2009, Fig 7.6
We can shatter the three points, but notfour with a linear classifier
Capacity
• Representational capacity – best function that can be learned within a set of learnable functions
• Frequently a difficult optimization problem– We might learn a suboptimal solution– This is called effective capacity
20
Measuring capacity
• Model order is not always a good predictor of capacity
21
Hastie et al., 2009, Fig .75
Label determined by sign of function. Increasing frequency of sinusoid enablesever finer distinctions…
Measuring capacity
• Vapnik Chervonikis (VC) dimension– for binary classifiers– Largest # of points that can be shattered by a family of classifiers.
• In practice, hard to estimate for deep learners… so why do we care?
22
Predictions about capacity
• Goal: Minimize the generalization error• Hard; perhaps minimize difference ΔΔ = training error - generalization error.
• Learning theory– Models with higher capacity have higher upper bound on Δ– Increasing amount of training data decreases Δ’s upper bound
23
Recap on capacity
24
Hastie et al., 2009, Fig 7.6
clipart-library.com
The NO FREE LUNCH Theorem
Expected performance of anyclassifier across all possible generalization tasks is no better than any other classifier.
A classifier might be better for some tasks, but no classifier is universally better than others.
25
http://elsalvavidas.mx/lifehacking/inicia-tu-aventura-para-estudiar-en-el-extranjero-con-estos-tips/1182/
N-fold cross validation
• Problem: – More data yields better training– Getting more data can be expensive
• Workaround– Partition data into N different groups– Train on N-1 groups, test on last group– Rotate to have N different evaluations
26Scikit learn’s KFold and StratifiedKFold will split example indices for you:sklearn.model_selection.*
Regularization
• Remember: Learners select a solution function from a set of hypothesis functions.
• Optimization picks the function that minimizes some optimization criterion
• Regularization lets us express a preference for certain types of solutions
27
Example:High Dimensional Polynomial Fit
• Suppose we want small coefficientsRemember:
• This happens when is small and results in shallower slopes
• We can define a new criterion:
where λ controls regularization influence28
ˆ Ty w x=
( ) Tw wwΩ =
( ) ( ) TJ w MSE w MSE wwλ λ= + Ω = +
9th degree polynomial fit with regularization
29
Goodfellow
et al. Fig. 5.5
Point estimators
• An approximation of interest• Examples:
– a statistic of a distribution
– a parameter of a classifiere.g. a mixture model weight or distribution parameter
30
e.g. 𝜇𝜇 = 1𝑁𝑁∑𝑖𝑖 𝑥𝑥(𝑖𝑖) is an approximation of 𝜇𝜇
Point estimators
In general, it is a function of data
and may even be a classifier function that maps data to a label (function estimation).
31
(1) (2) (3) ( )ˆ , , , , )( mm f x x x xΘ …=
Bias
How far from the true value is our estimator?
Goodfellow et al. give an example with a Bernoulli distribution that we have not yet covered (read 3.9.1). Bernoulli distributions are good for estimating the number of times that a binary event occurs (e.g., 42 head in 100 coin tosses).
32
( )ˆ ˆb s ]i [a m mEθ θ θ= −
Bias of sample mean
33
( )
( )
ˆ ˆbias
1
( )
[
[ ]
1
1 ]
0
m m
i
i
i
i
E xN
E xN
N E XN
Eµ µ µ
µ
µ
µ
µ µ
= − = −
=
=
−
= ⋅ −
= −
∑
∑
unbiased estimator
x(i) is a random var
Bias
• Read more examples in Goodfellow et al.• Bias of classifier functions?
– We are trying to estimate the Bayes classifier.– Bias is amount of error over that
34
Variance
• Already defined:• Variance of classifier functions
– Variance of a mean classification result, e.g., error rate
35
2( ) [ )( ]Var X E X µ= −
( )
( )
1 2
2 †1 22
2 22
1
( )
( )
or equivalantl
y, standard erro S
1ˆ
1
r E( )=
as ( )
1 ˆ
m
m
m
m
m
V
V
X X
V
ar Var Xm
X ar Xm
m
ar X X Var kX k
mm
µ
σσ σ µ
…+
=
= + +
+ +
=
…
=
+ =
† 2 2 2 it can be shown that ( ) [ ] [ ] follows from this, ( ) ( )X Var kX k Var XVar X E X E= =−
Bias & Variance
• Variance gives us an idea of classifier sensitivity to different data
• Distribution of mean approaches a normal distribution (central limit theorem)
• Together can estimate with 95% confidence that the real mean lies within:
36
1.96 ( ) 1.96ˆ ˆ (ˆ ˆ )m m m m mSE SEµ µ µ µ µ− +≤ ≤
Information Theory
A quick trip down the rabbit hole...• Details in Goodfellow
3.13• Needed for maximum
likelihood estimators
37British Postal Service, Graham Baker-Smith 2015
38
Quantity of information
• Amount of surprise that one sees when observing an event.
• If an event is rare, we can derive a large quantity of information from it.
1( ) logP( )i
i
I xx
=
0 0.2 0.4 0.6 0.8 10
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Pr(x)I(x
)
39
Quantity of information
• Why use log?– Suppose we want to know the information in two independent
events:1 2
1 2
1 21 2
1 2
1 2
1( , ) logP( , )
1log , independentP( ) P( )
1 1log logP( ) P( )
( ) ( )
I x xx x
x xx x
x xI x I x
=
=
= +
= +
40
Entropy
• Entropy is defined as the expected amount of information (average amount of surprise) and is usually denoted by the symbol H.
( ) [ ( )]P( ) ( ) is all possible symbols
1P( ) log definition ( )P( )
[ log P( )]
i
i
i ix S
i ix S i
H X E I Xx I x S
x I xx
E X
∈
∈
==
=
= −
∑
∑
Discrete vs continuous
Discrete• Shannon Entropy• Use log2
• Units called– bits, or sometimes– Shannons
Continuous• Differential entropy• Use loge
• Units called nats
41
42
Example
• Assume– X = 0, 1
–
• Then
H(x) versus p
)1log()1(log)]([)(
ppppXIEXH
−−−−==
0P( )
1 1p X
Xp X
== − =
p
Goodfellow
et al. Fig. 3.5
Comparing distributions• How similar are distributions P and Q?
Recall – X~P means that X has distribution P – we denote its probability as PP
• How much more information do we need to represent P’s distribution using Q:
43
( )~ ~
~
log log
log
( )log( )
( )
X P Q
x
P X P Q P
PX P
Q
PP
Q
E I I E P P
PEP
P xx
PP
x
− = − − −
=
=∑
Comparing distributions
• This is known as the Kullback-Leibler (KL) divergence from Q to P:
44
𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃(𝑋𝑋)𝑃𝑃𝑄𝑄(𝑋𝑋)
= 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) − log𝑃𝑃𝑄𝑄 (𝑋𝑋)
( || ) ( || ) is not a distance measurenote: KL KLD P Q D Q P≠
0 iff , otherwise( || ) ( || ) 0KL KLPD P Q PQ D Q= ≡ >
Cross entropy H(P,Q)
• Calculates total entropy in two distributions
• Can be shown to have the entropy of P plus the KL divergence from Q to P.
• Interesting as we sometimes want to minimize KL divergence…
45
𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐻𝐻(𝑃𝑃) + 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)
𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)
Cross entropy
Minimizing Q will minimize KL divergence
46
𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋) = 𝐻𝐻(𝑃𝑃) + 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)= 𝐻𝐻(𝑃𝑃) + 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) − log𝑃𝑃𝑄𝑄 (𝑋𝑋)= −𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) + 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) + 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)= 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)
Maximum Likelihood Estimation (MLE)
• Method to estimate model parameters• Suppose we have m independent samples drawn from a
distribution• Can we fit a model with parameters ϴ?
47
(1) (2) ( ) , , ,arg max |
( )
m
ML
x x xP
θθ θ==
…XX
MLE
• The x(i)’s are independent, so
• Transform to something easier…
• Traditional to use derivatives and solve for the maximum value.
48
( )arg max ( | ) arg m |ax ( )iML i
P P xθ θ
θ θ θ= Π= X
( ) ( )|arg max ( ) arg max log |( )i iML i i
P x P xθ θ
θ θ θΠ == ∑log likelihood
MLE through optimization
• Let’s reframe this problem:
• Maximized when 𝑃𝑃model is most like 𝑃𝑃data
• What does this remind us of?
49
[ ]
( )
ˆ~
arg max log ( )
arg max log ( | )
|
data
iML model
i
modelX P
P x
E P x
θ
θ
θ θ
θ
=
=
∑
MLE through optimization
• DKL( 𝑃𝑃data||Pmodel) minimized as 𝑃𝑃model becomes more like 𝑃𝑃data
• Recall
• and we can minimize one term of cross entropy
50
𝐷𝐷𝐾𝐾𝐾𝐾( 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑||𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚) = 𝐸𝐸𝑋𝑋~ 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 log 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑋𝑋) − log𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑋𝑋)
−𝐸𝐸𝑋𝑋~ 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 log𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑋𝑋)
Linear regression as MLE
• Instead of thinking of predicting value, what if we predicted a conditional probability?
• Regression: only one possible output.• MLE estimates a distribution: support for multiple
outcomes, can think of this as a noisy prediction.
51
( |ˆ )P y x
y xw=
Linear regression as MLE
• Assume Y|X~n(μ,σ2)– learn μ such that– σ2 fixed (noise)
• Can formulate as an MLE problem
where parameter ϴ is our weights w.
52
( ) ( )| i iE Y x y =
ML arg max | )( ;P Y Xθ
θ θ=
Linear regression as MLE
53
𝜃𝜃ML = arg max𝜃𝜃
𝑃𝑃(𝑌𝑌|𝑋𝑋;𝜃𝜃)= arg max
𝜃𝜃∏𝑖𝑖𝑃𝑃(𝑦𝑦(𝑖𝑖)|𝑥𝑥(𝑖𝑖);𝜃𝜃) if 𝑥𝑥(𝑖𝑖)′s independent & ~ identically
log 𝜃𝜃ML = arg max𝜃𝜃
∑𝑖𝑖 log𝑃𝑃 (𝑦𝑦(𝑖𝑖)|𝑥𝑥(𝑖𝑖);𝜃𝜃)
= arg max𝜃𝜃
∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2
𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝜇𝜇)2
2𝜎𝜎2 prediction of 𝑦𝑦(𝑖𝑖) is 𝑦𝑦(𝑖𝑖)
= arg max𝜃𝜃
∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2
𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝑦𝑦(𝑖𝑖))2
2𝜎𝜎2 as we want 𝐸𝐸[𝑌𝑌|𝑥𝑥(𝑖𝑖)] = 𝑦𝑦(𝑖𝑖)
Linear regression as MLE
54
log𝜃𝜃ML = arg max𝜃𝜃
∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2
𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝑦𝑦(𝑖𝑖))2
2𝜎𝜎2
= arg max𝜃𝜃
∑𝑖𝑖 −12
log 2𝜋𝜋 − log𝜎𝜎 −𝑦𝑦(𝑖𝑖) − 𝑦𝑦(𝑖𝑖) 2
2𝜎𝜎2
= arg max𝜃𝜃
−𝑚𝑚2
log 2𝜋𝜋 − 𝑚𝑚 log𝜎𝜎 − ∑𝑖𝑖𝑦𝑦(𝑖𝑖) − 𝑦𝑦(𝑖𝑖) 2
2𝜎𝜎2
Equivalent to maximizing which has thesame form as or MSE optimization:
( )2( ) ( )ˆ i ii y y−∑
𝑀𝑀𝑀𝑀𝐸𝐸train =1𝑚𝑚
𝑦𝑦train − 𝑦𝑦train2
2
MLE
• MLE can only recover the distribution if the parametric distribution is appropriate for the data.
• If data drawn from multiple distributions with varying parameters, MLE can still estimate distribution, but information about underlying distributions is lost.
55
Optimization
• We have seen closed form (single equation) optimization with linear regression.
• Not always so lucky… how do we optimize more complicated things?
56
Optimization
• Select an objective function f(x) to optimizee.g. MSE, KL divergence
• Without loss of generality, we will always consider minimization.
Why can we do this?
57
Gradient descent
58
Goodfellow
et al. Fig. 4.1
Critical points
59
Goodfellow
et al. Fig. 4.2
Global vs. local minima
60
Goodfellow
et al. Fig. 4.3
Functions on
• Gradient vector of partial derivatives
• Moving in gradient directionwill increase f(x) if we don’t gotoo far…
• To move to an x with a smaller f(x)
what we pick for ε will make a difference61
ℝ𝑚𝑚 → ℝ1
1
2
( )
( )
( )
( )
xm
x
xx
x
xf x
x
f x
f x
f x
∇ =
∂∂∂∂
∂ ∂
𝑥𝑥′ = 𝑥𝑥 − ϵ∇𝑥𝑥𝑓𝑓𝑥𝑥(𝑥𝑥)
IMPORTANT: f is fn to be optimized, e.g. MSE, not learner
Putting this all togethernote change in notation, objective is J()
• Want to learn: fϴ(xi) = yi
• To improve fϴ(xi), define objective J(ϴ)• Optimize J(ϴ)
– Gradients defined for each sample– Average over data set (e.g. mini-batch) and update ϴ
62
Putting this all together
• Sample J(ϴ), a loss function:
• which is like our effort to get a MLE by minimizing KL divergence:
63
[ ] ( ) ( )ˆ, ~
1
1
1) ( , , ) )
where ( , , ) log ( | , ) remember D1 or ) log ( | ,
( ( ,
?
( )
,data
mi i
x y pi
KLm
i
E L x ym
L x y
J L x y
P y x
P y xm
J
θ θ θ
θ θ
θ θ
=
=
= =
= −
== −
∑
∑
Loss (aka cost) is a measurement of how farwe are from the desired result.
𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)
“There’s the rub”Shakespeare’s Hamlet
• Computing J(ϴ) is expensiveOne example not so bad…
massive data sets… • Sample expectation relies on having enough
samples that the 1/m term estimates P(X).• What if we only evaluated some of them…
64
Stochastic gradient descent (SGD)
while not converged:pick minibatch of samples (x,y)compute gradientupdate estimate
minibatch is usually up to a 100 samples
65