machine learning concepts

65
Machine Learning Concepts Professor Marie Roch San Diego State University

Upload: others

Post on 01-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning Concepts

Machine Learning Concepts

Professor Marie RochSan Diego State University

Page 2: Machine Learning Concepts

Basic definitions & concepts

• Task – What the system is supposed to doe.g. ASR: f(input speech) list of words

• Performance measure – How well does it work?• Experience – How does the machine learn the task?

2

Page 3: Machine Learning Concepts

Types of experiences;how a learner learns…

• Supervised learning – Learn class conditional distributions:implies class labels are known

• Unsupervised learning – No labels are provided, learn P(x) and possibly group x’s into clusters.

• Reinforcement learning – Learner actions are associated with payouts for actions in environment.

3

|( )P xω

Page 4: Machine Learning Concepts

Experience data set:Design matrix

4

1,1 1,2 1,3 1, 1

2,1 2

3,1 3

,1 ,2 ,3 ,

D

N N N N D N

x x x x yx yx y

x x x x y

label(supervised problems)feature vector

Learning sets of functions may be used if some features are missing. E.g. fn f0 for all features, f1 if feature 1 is missing, etc.

Page 5: Machine Learning Concepts

A Cardinal RuleOF MACHINE LEARNING

THOU SHALTNOT TEST ONTHY TRAININGDATA 5

Page 6: Machine Learning Concepts

Performance

• A metric that measures how well a learner is able to accomplish the task

• Metrics can vary significantly (more on these later), examples:– loss functions such as squared error– cross entropy

6

Page 7: Machine Learning Concepts

Partitioning data

• Training data – experience for learner• Test data – performance measurement• Evaluation data

– Only used once all adjustments are made– It is common to:

7

train test

adjust

• Adjustments are a form of training (see 5.3)

• Evaluation provides an independent test

Page 8: Machine Learning Concepts

Regression

Given a set of features and response, predict a response

8

IPCC 4th Assessment ReportClimate Change 2007

Observed changes in numberof warm days/night withextreme temperatures (normative 1961-1990)

Page 9: Machine Learning Concepts

Linear regressionA simple learning algorithm

Predict response from data

• w is the weight vector• Goal: Maximize performance on test set.

Learn w to minimize some criterion, e.g. mean squared error (MSE)

9

ˆ Ty w x= , , ˆN yw x∈ℜ ∈ℜ

( ) 22 test testtest 2

1

1 1ˆ ˆm

i ii

y y y yMm

SEm=

= − ≡ −∑

1

denotes the norm | | read Goodfellow et al. 2.5p

p pip

ix L x

Page 10: Machine Learning Concepts

Linear regression

• Cannot estimate w from test data• Use the training data• Minimize

10

( ) 22 train traintrain 2

1

1 1ˆ ˆm

i ii

y ym

MS yE ym =

−= − ≡∑

For convenience, we will usually omit the variable descriptor train when describing training.

Page 11: Machine Learning Concepts

Linear regression

MSE minimized when gradient is zero

11

2

2

2

22

2

1 ˆ 0

ˆ 0

ˆ0 s

0

a

w

w

w

w

y ymy y

Xw y

MSE

y Xw

=

∇ =

∇ −

∇ =−

=− =∇

Page 12: Machine Learning Concepts

Linear regression

12

( ) ( )( )( )( )

( )( )( )

2

2

22

0

=0 norm in matrix notation

=0 transpose distributive over addition

=0 as (A Goodfellow et al. eqn 2.9

0

B)

w

Tw

T Tw

T T T Tw

T T T T T Tw

T T

w

B A

w Xw w y y Xw y

w

Xw y

Xw y Xw y L

Xw y Xw y

y Xw y

X y

X

X

∇ −

∇ − −

∇ − −

∇ − − =

− − + =

=

∇ ( ) ( )( )

0 as

2 0

TT T T T T T T T T T

T T

T

T Tw

X X yw Xw y Xw y y w w y Xwy Xw y X

w X yX w y Xw y

− − + =

∇ − +

==

=

Page 13: Machine Learning Concepts

Linear regression

13

( )( )( )

( )

2

1

2 0

2 0 as independent of

2 0 as

2 2 0 derivative

matrix inv

T T T Tw

T T T Tw

T T T Tw

T

T T

T T

T T

w Xw y Xw y y

w Xw y Xw y y w

y Xw w X X X Xw

X Xw y XX Xw y X

w X

X Xw

X y X

X

X

∇ − + =

∇ − =

∇ − = =

− =

=

= 1erse A A I− =

These are referred to as the normal equations

Page 14: Machine Learning Concepts

Linear regression

14

Fig. 5.1 Goodfellow

et al.

Normal equations optimize w

Page 15: Machine Learning Concepts

Linear regression

Regression formula forces curve to pass through originRemove restriction:• Add bias term• To use normal equations,

use modified x• Last term of new weight

vector w is bias

15

ˆ Ty w x b= +

1

2

1

mod

D

xx

xx

=

Page 16: Machine Learning Concepts

Notes on learning

• A learner that performs well on unseen data is said to generalize well.

• When will learning on training data produce good generalization?– Training and test data drawn from the same distribution– Large enough training set to learn the distribution

16

Page 17: Machine Learning Concepts

Underfitting & Overfitting

• UnderfitModel cannot learn training data well

• OverfitModel does not generalize well

• These properties are related to model capacity

17

Page 18: Machine Learning Concepts

Capacity

What kind of functions can we learn?

18

Fig. 5.2 Goodfellow

et al.

1 00 00 1y w x w x= + 0 1

2 1 00 0 02y w w xx w x= ++

00

ˆ Di

D

i

iy w x −

=

=∑

D=9

Here, model order affects capacity

Page 19: Machine Learning Concepts

Shattering points

Points are shattered if a classifier can separate them regardless of their binary label

19

Hastie et al., 2009, Fig 7.6

We can shatter the three points, but notfour with a linear classifier

Page 20: Machine Learning Concepts

Capacity

• Representational capacity – best function that can be learned within a set of learnable functions

• Frequently a difficult optimization problem– We might learn a suboptimal solution– This is called effective capacity

20

Page 21: Machine Learning Concepts

Measuring capacity

• Model order is not always a good predictor of capacity

21

Hastie et al., 2009, Fig .75

Label determined by sign of function. Increasing frequency of sinusoid enablesever finer distinctions…

Page 22: Machine Learning Concepts

Measuring capacity

• Vapnik Chervonikis (VC) dimension– for binary classifiers– Largest # of points that can be shattered by a family of classifiers.

• In practice, hard to estimate for deep learners… so why do we care?

22

Page 23: Machine Learning Concepts

Predictions about capacity

• Goal: Minimize the generalization error• Hard; perhaps minimize difference ΔΔ = training error - generalization error.

• Learning theory– Models with higher capacity have higher upper bound on Δ– Increasing amount of training data decreases Δ’s upper bound

23

Page 24: Machine Learning Concepts

Recap on capacity

24

Hastie et al., 2009, Fig 7.6

clipart-library.com

Page 25: Machine Learning Concepts

The NO FREE LUNCH Theorem

Expected performance of anyclassifier across all possible generalization tasks is no better than any other classifier.

A classifier might be better for some tasks, but no classifier is universally better than others.

25

http://elsalvavidas.mx/lifehacking/inicia-tu-aventura-para-estudiar-en-el-extranjero-con-estos-tips/1182/

Page 26: Machine Learning Concepts

N-fold cross validation

• Problem: – More data yields better training– Getting more data can be expensive

• Workaround– Partition data into N different groups– Train on N-1 groups, test on last group– Rotate to have N different evaluations

26Scikit learn’s KFold and StratifiedKFold will split example indices for you:sklearn.model_selection.*

Page 27: Machine Learning Concepts

Regularization

• Remember: Learners select a solution function from a set of hypothesis functions.

• Optimization picks the function that minimizes some optimization criterion

• Regularization lets us express a preference for certain types of solutions

27

Page 28: Machine Learning Concepts

Example:High Dimensional Polynomial Fit

• Suppose we want small coefficientsRemember:

• This happens when is small and results in shallower slopes

• We can define a new criterion:

where λ controls regularization influence28

ˆ Ty w x=

( ) Tw wwΩ =

( ) ( ) TJ w MSE w MSE wwλ λ= + Ω = +

Page 29: Machine Learning Concepts

9th degree polynomial fit with regularization

29

Goodfellow

et al. Fig. 5.5

Page 30: Machine Learning Concepts

Point estimators

• An approximation of interest• Examples:

– a statistic of a distribution

– a parameter of a classifiere.g. a mixture model weight or distribution parameter

30

e.g. 𝜇𝜇 = 1𝑁𝑁∑𝑖𝑖 𝑥𝑥(𝑖𝑖) is an approximation of 𝜇𝜇

Page 31: Machine Learning Concepts

Point estimators

In general, it is a function of data

and may even be a classifier function that maps data to a label (function estimation).

31

(1) (2) (3) ( )ˆ , , , , )( mm f x x x xΘ …=

Page 32: Machine Learning Concepts

Bias

How far from the true value is our estimator?

Goodfellow et al. give an example with a Bernoulli distribution that we have not yet covered (read 3.9.1). Bernoulli distributions are good for estimating the number of times that a binary event occurs (e.g., 42 head in 100 coin tosses).

32

( )ˆ ˆb s ]i [a m mEθ θ θ= −

Page 33: Machine Learning Concepts

Bias of sample mean

33

( )

( )

ˆ ˆbias

1

( )

[

[ ]

1

1 ]

0

m m

i

i

i

i

E xN

E xN

N E XN

Eµ µ µ

µ

µ

µ

µ µ

= − = −

=

=

= ⋅ −

= −

unbiased estimator

x(i) is a random var

Page 34: Machine Learning Concepts

Bias

• Read more examples in Goodfellow et al.• Bias of classifier functions?

– We are trying to estimate the Bayes classifier.– Bias is amount of error over that

34

Page 35: Machine Learning Concepts

Variance

• Already defined:• Variance of classifier functions

– Variance of a mean classification result, e.g., error rate

35

2( ) [ )( ]Var X E X µ= −

( )

( )

1 2

2 †1 22

2 22

1

( )

( )

or equivalantl

y, standard erro S

1

r E( )=

as ( )

1 ˆ

m

m

m

m

m

V

V

X X

V

ar Var Xm

X ar Xm

m

ar X X Var kX k

mm

µ

σσ σ µ

…+

=

= + +

+ +

=

=

+ =

† 2 2 2 it can be shown that ( ) [ ] [ ] follows from this, ( ) ( )X Var kX k Var XVar X E X E= =−

Page 36: Machine Learning Concepts

Bias & Variance

• Variance gives us an idea of classifier sensitivity to different data

• Distribution of mean approaches a normal distribution (central limit theorem)

• Together can estimate with 95% confidence that the real mean lies within:

36

1.96 ( ) 1.96ˆ ˆ (ˆ ˆ )m m m m mSE SEµ µ µ µ µ− +≤ ≤

Page 37: Machine Learning Concepts

Information Theory

A quick trip down the rabbit hole...• Details in Goodfellow

3.13• Needed for maximum

likelihood estimators

37British Postal Service, Graham Baker-Smith 2015

Page 38: Machine Learning Concepts

38

Quantity of information

• Amount of surprise that one sees when observing an event.

• If an event is rare, we can derive a large quantity of information from it.

1( ) logP( )i

i

I xx

=

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Pr(x)I(x

)

Page 39: Machine Learning Concepts

39

Quantity of information

• Why use log?– Suppose we want to know the information in two independent

events:1 2

1 2

1 21 2

1 2

1 2

1( , ) logP( , )

1log , independentP( ) P( )

1 1log logP( ) P( )

( ) ( )

I x xx x

x xx x

x xI x I x

=

=

= +

= +

Page 40: Machine Learning Concepts

40

Entropy

• Entropy is defined as the expected amount of information (average amount of surprise) and is usually denoted by the symbol H.

( ) [ ( )]P( ) ( ) is all possible symbols

1P( ) log definition ( )P( )

[ log P( )]

i

i

i ix S

i ix S i

H X E I Xx I x S

x I xx

E X

==

=

= −

Page 41: Machine Learning Concepts

Discrete vs continuous

Discrete• Shannon Entropy• Use log2

• Units called– bits, or sometimes– Shannons

Continuous• Differential entropy• Use loge

• Units called nats

41

Page 42: Machine Learning Concepts

42

Example

• Assume– X = 0, 1

• Then

H(x) versus p

)1log()1(log)]([)(

ppppXIEXH

−−−−==

0P( )

1 1p X

Xp X

== − =

p

Goodfellow

et al. Fig. 3.5

Page 43: Machine Learning Concepts

Comparing distributions• How similar are distributions P and Q?

Recall – X~P means that X has distribution P – we denote its probability as PP

• How much more information do we need to represent P’s distribution using Q:

43

( )~ ~

~

log log

log

( )log( )

( )

X P Q

x

P X P Q P

PX P

Q

PP

Q

E I I E P P

PEP

P xx

PP

x

− = − − −

=

=∑

Page 44: Machine Learning Concepts

Comparing distributions

• This is known as the Kullback-Leibler (KL) divergence from Q to P:

44

𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃(𝑋𝑋)𝑃𝑃𝑄𝑄(𝑋𝑋)

= 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

( || ) ( || ) is not a distance measurenote: KL KLD P Q D Q P≠

0 iff , otherwise( || ) ( || ) 0KL KLPD P Q PQ D Q= ≡ >

Page 45: Machine Learning Concepts

Cross entropy H(P,Q)

• Calculates total entropy in two distributions

• Can be shown to have the entropy of P plus the KL divergence from Q to P.

• Interesting as we sometimes want to minimize KL divergence…

45

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐻𝐻(𝑃𝑃) + 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

Page 46: Machine Learning Concepts

Cross entropy

Minimizing Q will minimize KL divergence

46

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋) = 𝐻𝐻(𝑃𝑃) + 𝐷𝐷𝐾𝐾𝐾𝐾(𝑃𝑃||𝑄𝑄)= 𝐻𝐻(𝑃𝑃) + 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) − log𝑃𝑃𝑄𝑄 (𝑋𝑋)= −𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) + 𝐸𝐸𝑋𝑋~𝑃𝑃 log𝑃𝑃𝑃𝑃 (𝑋𝑋) + 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)= 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

Page 47: Machine Learning Concepts

Maximum Likelihood Estimation (MLE)

• Method to estimate model parameters• Suppose we have m independent samples drawn from a

distribution• Can we fit a model with parameters ϴ?

47

(1) (2) ( ) , , ,arg max |

( )

m

ML

x x xP

θθ θ==

…XX

Page 48: Machine Learning Concepts

MLE

• The x(i)’s are independent, so

• Transform to something easier…

• Traditional to use derivatives and solve for the maximum value.

48

( )arg max ( | ) arg m |ax ( )iML i

P P xθ θ

θ θ θ= Π= X

( ) ( )|arg max ( ) arg max log |( )i iML i i

P x P xθ θ

θ θ θΠ == ∑log likelihood

Page 49: Machine Learning Concepts

MLE through optimization

• Let’s reframe this problem:

• Maximized when 𝑃𝑃model is most like 𝑃𝑃data

• What does this remind us of?

49

[ ]

( )

ˆ~

arg max log ( )

arg max log ( | )

|

data

iML model

i

modelX P

P x

E P x

θ

θ

θ θ

θ

=

=

Page 50: Machine Learning Concepts

MLE through optimization

• DKL( 𝑃𝑃data||Pmodel) minimized as 𝑃𝑃model becomes more like 𝑃𝑃data

• Recall

• and we can minimize one term of cross entropy

50

𝐷𝐷𝐾𝐾𝐾𝐾( 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑||𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚) = 𝐸𝐸𝑋𝑋~ 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 log 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑(𝑋𝑋) − log𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑋𝑋)

−𝐸𝐸𝑋𝑋~ 𝑃𝑃𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 log𝑃𝑃𝑚𝑚𝑚𝑚𝑑𝑑𝑚𝑚𝑚𝑚 (𝑋𝑋)

Page 51: Machine Learning Concepts

Linear regression as MLE

• Instead of thinking of predicting value, what if we predicted a conditional probability?

• Regression: only one possible output.• MLE estimates a distribution: support for multiple

outcomes, can think of this as a noisy prediction.

51

( |ˆ )P y x

y xw=

Page 52: Machine Learning Concepts

Linear regression as MLE

• Assume Y|X~n(μ,σ2)– learn μ such that– σ2 fixed (noise)

• Can formulate as an MLE problem

where parameter ϴ is our weights w.

52

( ) ( )| i iE Y x y =

ML arg max | )( ;P Y Xθ

θ θ=

Page 53: Machine Learning Concepts

Linear regression as MLE

53

𝜃𝜃ML = arg max𝜃𝜃

𝑃𝑃(𝑌𝑌|𝑋𝑋;𝜃𝜃)= arg max

𝜃𝜃∏𝑖𝑖𝑃𝑃(𝑦𝑦(𝑖𝑖)|𝑥𝑥(𝑖𝑖);𝜃𝜃) if 𝑥𝑥(𝑖𝑖)′s independent & ~ identically

log 𝜃𝜃ML = arg max𝜃𝜃

∑𝑖𝑖 log𝑃𝑃 (𝑦𝑦(𝑖𝑖)|𝑥𝑥(𝑖𝑖);𝜃𝜃)

= arg max𝜃𝜃

∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2

𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝜇𝜇)2

2𝜎𝜎2 prediction of 𝑦𝑦(𝑖𝑖) is 𝑦𝑦(𝑖𝑖)

= arg max𝜃𝜃

∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2

𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝑦𝑦(𝑖𝑖))2

2𝜎𝜎2 as we want 𝐸𝐸[𝑌𝑌|𝑥𝑥(𝑖𝑖)] = 𝑦𝑦(𝑖𝑖)

Page 54: Machine Learning Concepts

Linear regression as MLE

54

log𝜃𝜃ML = arg max𝜃𝜃

∑𝑖𝑖 log12𝜋𝜋𝜎𝜎2

𝑒𝑒−( 𝑦𝑦(𝑖𝑖)−𝑦𝑦(𝑖𝑖))2

2𝜎𝜎2

= arg max𝜃𝜃

∑𝑖𝑖 −12

log 2𝜋𝜋 − log𝜎𝜎 −𝑦𝑦(𝑖𝑖) − 𝑦𝑦(𝑖𝑖) 2

2𝜎𝜎2

= arg max𝜃𝜃

−𝑚𝑚2

log 2𝜋𝜋 − 𝑚𝑚 log𝜎𝜎 − ∑𝑖𝑖𝑦𝑦(𝑖𝑖) − 𝑦𝑦(𝑖𝑖) 2

2𝜎𝜎2

Equivalent to maximizing which has thesame form as or MSE optimization:

( )2( ) ( )ˆ i ii y y−∑

𝑀𝑀𝑀𝑀𝐸𝐸train =1𝑚𝑚

𝑦𝑦train − 𝑦𝑦train2

2

Page 55: Machine Learning Concepts

MLE

• MLE can only recover the distribution if the parametric distribution is appropriate for the data.

• If data drawn from multiple distributions with varying parameters, MLE can still estimate distribution, but information about underlying distributions is lost.

55

Page 56: Machine Learning Concepts

Optimization

• We have seen closed form (single equation) optimization with linear regression.

• Not always so lucky… how do we optimize more complicated things?

56

Page 57: Machine Learning Concepts

Optimization

• Select an objective function f(x) to optimizee.g. MSE, KL divergence

• Without loss of generality, we will always consider minimization.

Why can we do this?

57

Page 58: Machine Learning Concepts

Gradient descent

58

Goodfellow

et al. Fig. 4.1

Page 59: Machine Learning Concepts

Critical points

59

Goodfellow

et al. Fig. 4.2

Page 60: Machine Learning Concepts

Global vs. local minima

60

Goodfellow

et al. Fig. 4.3

Page 61: Machine Learning Concepts

Functions on

• Gradient vector of partial derivatives

• Moving in gradient directionwill increase f(x) if we don’t gotoo far…

• To move to an x with a smaller f(x)

what we pick for ε will make a difference61

ℝ𝑚𝑚 → ℝ1

1

2

( )

( )

( )

( )

xm

x

xx

x

xf x

x

f x

f x

f x

∇ =

∂∂∂∂

∂ ∂

𝑥𝑥′ = 𝑥𝑥 − ϵ∇𝑥𝑥𝑓𝑓𝑥𝑥(𝑥𝑥)

IMPORTANT: f is fn to be optimized, e.g. MSE, not learner

Page 62: Machine Learning Concepts

Putting this all togethernote change in notation, objective is J()

• Want to learn: fϴ(xi) = yi

• To improve fϴ(xi), define objective J(ϴ)• Optimize J(ϴ)

– Gradients defined for each sample– Average over data set (e.g. mini-batch) and update ϴ

62

Page 63: Machine Learning Concepts

Putting this all together

• Sample J(ϴ), a loss function:

• which is like our effort to get a MLE by minimizing KL divergence:

63

[ ] ( ) ( )ˆ, ~

1

1

1) ( , , ) )

where ( , , ) log ( | , ) remember D1 or ) log ( | ,

( ( ,

?

( )

,data

mi i

x y pi

KLm

i

E L x ym

L x y

J L x y

P y x

P y xm

J

θ θ θ

θ θ

θ θ

=

=

= =

= −

== −

Loss (aka cost) is a measurement of how farwe are from the desired result.

𝐻𝐻(𝑃𝑃,𝑄𝑄) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋) = 𝐸𝐸𝑋𝑋~𝑃𝑃 − log𝑃𝑃𝑄𝑄 (𝑋𝑋)

Page 64: Machine Learning Concepts

“There’s the rub”Shakespeare’s Hamlet

• Computing J(ϴ) is expensiveOne example not so bad…

massive data sets… • Sample expectation relies on having enough

samples that the 1/m term estimates P(X).• What if we only evaluated some of them…

64

Page 65: Machine Learning Concepts

Stochastic gradient descent (SGD)

while not converged:pick minibatch of samples (x,y)compute gradientupdate estimate

minibatch is usually up to a 100 samples

65