csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...

Post on 24-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CSCE 633: Machine Learning

Lecture 1: Introduction and Probability

Texas A&M University

8-28-18

Welcome to CSCE 633!

• About this class• Who are we?• Syllabus• Text and other references

• Introduction to Machine Learning

B Mortazavi CSECopyright 2018 1

Goals of this lecture

• Course structure

• Understanding difference between supervised and unsupervisedmachine learning

• Review of matrix operations

• Understanding differences in notation

• Understanding differences in some supervised models

B Mortazavi CSECopyright 2018 2

Who are we?

• Instructor• Bobak Mortazavi• bobakm@tamu.edu - Please put [CSCE 633] in subject line• Office Hours: W 10-11a R 11-12p 328A HRBB

• TA• Xiaohan Chen• chernxh@tamu.edu• Office Hours: M 10:15-11:15a W 4:30-5:30p 320 HRBB

• Discussion Board on eCampus

B Mortazavi CSECopyright 2018 3

Syllabus, Schedule, and eCampus

B Mortazavi CSECopyright 2018 4

Text Book

Textbook for this course

B Mortazavi CSECopyright 2018 5

Other useful references

B Mortazavi CSECopyright 2018 6

Assignments

• Homework will be compiled in Latex - you will turn in PDFs

• Programming for homework will be in R or Python.

• for Projects - if you need to use a different language pleaseexplain why

• Project teams will be 2-3 people per team.

• Project proposals will be a 2 page written document

• Project report will be an 8 page IEEE conference-style report

B Mortazavi CSECopyright 2018 7

Kahoot

• kahoot.com

• smartphone applications

• ability to interact in class!

• not graded

B Mortazavi CSECopyright 2018 8

Topics

• About this class• Introduction to Machine Learning

• What is machine learning?• Models and accuracy• Some general supervised learning

B Mortazavi CSECopyright 2018 9

What is machine learning?

B Mortazavi CSECopyright 2018 10

What is machine learning?

Big Data! 40 billion web pages, 100 hours of videos on YouTubeevery minute, genomes of 1000s of people.

Murphy defines Machine learning as a set of methods that canautomatically detect patterns in data, and then use the uncoveredpatterns to predict future data, or to perform other kindso fdecision making under uncertainty (such as planning how to collectmore data!).

B Mortazavi CSECopyright 2018 11

Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018

B Mortazavi CSECopyright 2018 12

Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018

B Mortazavi CSECopyright 2018 13

Some cool examples!

Waymo.com/tech

B Mortazavi CSECopyright 2018 14

Some cool examples!

Waymo.com/tech

B Mortazavi CSECopyright 2018 15

Some cool examples!

Waymo.com/tech

B Mortazavi CSECopyright 2018 16

What is machine learning?

Statistical learning (machine learning) is a vast set of tools tounderstand data.

B Mortazavi CSECopyright 2018 17

a brief review of linear algebra

X =

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

B Mortazavi CSECopyright 2018 18

a brief review of linear algebra

X =

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

Xj is column j, so we can re-write X as:

(X1,X2, · · · ,Xp) or

XT1

XT2...

XTn

B Mortazavi CSECopyright 2018 19

Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s

B Mortazavi CSECopyright 2018 20

Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s then (ABij) =∑d

k=1 aikbkj .For example, consider:

A =

(1 23 4

)and

B =

(5 67 8

), then

AB =

(1 23 4

)(5 67 8

)=

(1× 5 + 2× 7 1× 6 + 2× 83× 5 + 4× 7 3× 6 + 4× 8

)=(

19 2243 50

)

B Mortazavi CSECopyright 2018 21

Vectors

• if x ∈ Rp×1, what is the ”size” of this vector?

B Mortazavi CSECopyright 2018 22

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?• Mapping of Rn to R with the following properties:

•∥∥x∥∥ ≥ 0 for all x

•∥∥x∥∥ = 0↔ x = 0

•∥∥αx

∥∥ = |α|∥∥x∥∥ , α ∈ R

•∥∥x + y

∥∥ ≤ ∥∥x∥∥+∥∥y∥∥ for all x and y

B Mortazavi CSECopyright 2018 23

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣

B Mortazavi CSECopyright 2018 24

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

B Mortazavi CSECopyright 2018 25

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

B Mortazavi CSECopyright 2018 26

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?

• 1-norm∥∥x∥∥

1=∑p

i=1

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

• Lp-norm (or p-norm):∥∥x∥∥p

=(∑

i∈N |xi |p) 1

p

B Mortazavi CSECopyright 2018 27

What is machine learning?

Statistical learning (machine learning) is a vast set of tools tounderstand data.

B Mortazavi CSECopyright 2018 28

What is machine learning?

Statistical learning (machine learning) is a vast set of tools tounderstand data.

• Supervised Modeling - where we aim to predict or estimate anoutput based upon our input data

• Unsupervised Modeling - where we aim to learn relationshipand structure in input data without the guidance of outputdata

• Reinforcement Learning - where we aim to learn optimal rules,by rewarding the modeling method when it correctly satisfiessome goal

B Mortazavi CSECopyright 2018 29

Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

B Mortazavi CSECopyright 2018 30

Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

• Model for the relationship is then: Y = f (X ) + ε

• f is the systematic information, ε is the random error

B Mortazavi CSECopyright 2018 31

Why estimate f ?

• Prediction

• Inference

B Mortazavi CSECopyright 2018 32

Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference

B Mortazavi CSECopyright 2018 33

Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference - We would like to understand how Y is affectedbby changes in X .

B Mortazavi CSECopyright 2018 34

Predicting f ?

Create a f (X ) that generates a Y that estimates Y as closely aspossible by attempting to minimize the reducible error, acceptingthat there is some irreducible error generated by ε that isindependent of X .

Why is there irreducible error?

B Mortazavi CSECopyright 2018 35

Estimating f

Assume we have an estimate f and a set of predictors X , such thatY = f (X ). Then we can calculate the expected value of theaverage error as:

E(Y − Y )2 = E[f (X ) + ε− f (X )]2 = [f (X )− f (X )]2 + Var(ε)

The focus of this class is to learn techniques to estimate f tominimize reducible error.

B Mortazavi CSECopyright 2018 36

How do we estimate f

• Need: Training Data to learn from

B Mortazavi CSECopyright 2018 37

How do we estimate f

• Need: Training Data to learn from

• Training data D = {(x1, y1), (x2, y2), · · · , (xn, yn)}, wherexi = (xi1, xi2, · · · , xip)T for n subjects and p dimensions.

• Note: The text uses notation N subjects and D dimensions• y can be:

• categorical or nominal → the problem we solve is classificationor pattern recognition

• continuous → the problem we solve is regression• categorical with order (grades A, B, C, D, F) → the problem

we solve is ordinal regression

B Mortazavi CSECopyright 2018 38

Parametric Methods

• 2-step model based approach to estimate f .

• Step 1 - make an assumption about the form of f . Forexample, that f is linear

• f (X ) = β0 + β1x1 + β2x2 + · · ·+ βpxp• Advantage - only have to estimate p + 1 coefficients to modelf

• Step 2 - fit or train a model/procedure to estimate β

• The most comment method for this is Ordinary Least Squares.

B Mortazavi CSECopyright 2018 39

Parametric vs. Non-parametric

B Mortazavi CSECopyright 2018 40

Flexibiilty vs. Interpretability

B Mortazavi CSECopyright 2018 41

Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

B Mortazavi CSECopyright 2018 42

Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

• why do we care about previously seen data? Instead we careabout unseen test data to understand out the model predicts.

• so how do we select methods that reduce MSE on test data?

B Mortazavi CSECopyright 2018 43

Bias Variance Tradeoff

• Variance - the amount f would change with different trainingdata

• Bias - the error introduced by approximating f with a simplermodel

• E(y − f (x))2 = Var(f (x)) + [Bias(f (x))]2 + Var(ε)

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

B Mortazavi CSECopyright 2018 44

Bias Variance Tradeoff

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

• Test error is minimized by understanding the conditionalprobability p(Y = j |X = x0)

• Bayes Error Rate 1− E(maxjp(Y = j |X ))

• K-Nearest Neighbor to solve this:p(Y = j |X = x0) = 1

K

∑i∈N0

I(yi = j)

• More on this later!

B Mortazavi CSECopyright 2018 45

Key Challenge: Avoid Overfitting

B Mortazavi CSECopyright 2018 46

U-Shape Tradeoff

B Mortazavi CSECopyright 2018 47

Multiclass Classification

• yi |y ∈ {1, 2, · · · ,C}, where, if C = 2, y ∈ {0, 1} and if C > 2we have a multi-class classification problem.

B Mortazavi CSECopyright 2018 48

Multiclass Classification

• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

B Mortazavi CSECopyright 2018 49

Multiclass Classification

• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

B Mortazavi CSECopyright 2018 50

Another Example: Advanced Notation

• Suppose you want to model the cost of a used car

• inputs: x are the car attributes (brand, mileage, year, etc.)

• output: y price of the car

• learn: model parameters w (β in prior slides)

• Deterministic linear model: y = f (x |w) = wT x

• Deterministic non-linear model: y = f (x |w) = wTφ(x)

• Non-deterministic non-linear model:y = f (x |w) = wTφ(x) + ε where ε ∼ N(µ, σ2)

B Mortazavi CSECopyright 2018 51

Another Example: Linear Regression

• Parameters θ = (w , σ2)

• Clustering - model cars by p(xi |θ) and see if there is a naturalgrouping of cars

• µ(x) = w0 + w1x = wT x

• Linear regression: p(y |x , θ) = N(y |wTφ(x), σ2) - the basisfunction expansion with a Gaussian Distribution

B Mortazavi CSECopyright 2018 52

Logistic Regression

• Assume y ∈ 0, 1, then a Bernoulli distribution is moreaccurate than a Gaussian

• p(y |x ,w) = Ber(y |µ(x)), where µ(x) = E[y |x ] = p(y = 1|x)

• If we require 0 ≤ µ(x) ≤ 1, say by a sigmoid function

• µ(x) = sigm(wT x) = 1

1+exp−wT x(sometimes called the logit

or logistic function)

• Then can classify by a decision rule likey = 1↔ p(y = 1|x) > 0.5

B Mortazavi CSECopyright 2018 53

Logistic Regression

• p(y |x , β) = sigm(βT x) = 1

1+exp−βT x

B Mortazavi CSECopyright 2018 54

K Nearest Neighbor

• p(y = c|x ,D, k) = 1k

∑i∈Nk (x ,D) I(yi = c)

B Mortazavi CSECopyright 2018 55

Curse of Dimensionality

•B Mortazavi CSE

Copyright 2018 56

No Free Lunch Theorem

• ”All models are wrong but some models are useful”, G. Box,1987

• No single ML system works for everything

• Principal Component Analysis a way to solve curse ofdimensionality

• Machine learning is not magic: it can’t get something out ofnothing, but it can get more from less!

B Mortazavi CSECopyright 2018 57

Takeaways and Next Time

• What is the difference between supervised and unsupervisedlearning?

• What are some basics of supervised learning?

• Difference between parametric and non-parametric modeling

• Next Time: Probability, K-Nearest Neighbor, and Naive BayesClassification

• Sources: Machine Learning, A Probabilistic Perspective -Murphy, and An Introduction to Statistical Learning - Jameset al

B Mortazavi CSECopyright 2018 58

top related