csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...

CSCE 633: Machine Learning

Lecture 1: Introduction and Probability

Texas A&M University

8-28-18

Welcome to CSCE 633!

• About this class• Who are we?• Syllabus• Text and other references

• Introduction to Machine Learning

Goals of this lecture

• Course structure

• Understanding difference between supervised and unsupervisedmachine learning

• Review of matrix operations

• Understanding differences in notation

• Understanding differences in some supervised models

Who are we?

• Instructor• Bobak Mortazavi• bobakm@tamu.edu - Please put [CSCE 633] in subject line• Office Hours: W 10-11a R 11-12p 328A HRBB

• TA• Xiaohan Chen• chernxh@tamu.edu• Office Hours: M 10:15-11:15a W 4:30-5:30p 320 HRBB

• Discussion Board on eCampus

Syllabus, Schedule, and eCampus

Text Book

Textbook for this course

Other useful references

Assignments

• Homework will be compiled in Latex - you will turn in PDFs

• Programming for homework will be in R or Python.

• for Projects - if you need to use a different language pleaseexplain why

• Project teams will be 2-3 people per team.

• Project proposals will be a 2 page written document

• Project report will be an 8 page IEEE conference-style report

Kahoot

• kahoot.com

• smartphone applications

• ability to interact in class!

• not graded

Topics

• About this class• Introduction to Machine Learning

• What is machine learning?• Models and accuracy• Some general supervised learning

What is machine learning?

Big Data! 40 billion web pages, 100 hours of videos on YouTubeevery minute, genomes of 1000s of people.

Murphy defines Machine learning as a set of methods that canautomatically detect patterns in data, and then use the uncoveredpatterns to predict future data, or to perform other kindso fdecision making under uncertainty (such as planning how to collectmore data!).

Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018

Some cool examples!

Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018

Some cool examples!

Waymo.com/tech

Some cool examples!

Waymo.com/tech

Some cool examples!

Waymo.com/tech

Statistical learning (machine learning) is a vast set of tools tounderstand data.

a brief review of linear algebra

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

a brief review of linear algebra

x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p

......

. . ....

xn,1 xn,2 · · · xn,p

where X is n × p (sometimes N × p or in text book N × D)

Xj is column j, so we can re-write X as:

(X1,X2, · · · ,Xp) or

XT2...

Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s

Matrix Multiplication

Suppose A ∈ Rr×d and B ∈ Rd×s then (ABij) =∑d

k=1 aikbkj .For example, consider:

(1 23 4

(5 67 8

), then

(1 23 4

)(5 67 8

(1× 5 + 2× 7 1× 6 + 2× 83× 5 + 4× 7 3× 6 + 4× 8

19 2243 50

Vectors

• if x ∈ Rp×1, what is the ”size” of this vector?

Vector Norms

• if x ∈ Rp×1, what is the ”size” of this vector?• Mapping of Rn to R with the following properties:

•∥∥x∥∥ ≥ 0 for all x

•∥∥x∥∥ = 0↔ x = 0

•∥∥αx

∥∥ = |α|∥∥x∥∥ , α ∈ R

•∥∥x + y

∥∥ ≤ ∥∥x∥∥+∥∥y∥∥ for all x and y

Vector Norms

• 1-norm∥∥x∥∥

1=∑p

∣∣xi ∣∣

Vector Norms

• 1-norm∥∥x∥∥

1=∑p

∣∣xi ∣∣• 2-norm

∥∥x∥∥2

=√∑p

i=1 x2i

Vector Norms

• 1-norm∥∥x∥∥

1=∑p

∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

Vector Norms

• 1-norm∥∥x∥∥

1=∑p

∥∥x∥∥2

=√∑p

i=1 x2i

• max-norm∥∥x∥∥∞ = max

1≤i≤p|xi |

• Lp-norm (or p-norm):∥∥x∥∥p

i∈N |xi |p) 1

• Supervised Modeling - where we aim to predict or estimate anoutput based upon our input data

• Unsupervised Modeling - where we aim to learn relationshipand structure in input data without the guidance of outputdata

• Reinforcement Learning - where we aim to learn optimal rules,by rewarding the modeling method when it correctly satisfiessome goal

Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

Advertising Example

Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:

• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).

• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)

• Can generalize input predictors to X1,X2, · · · ,Xp = X

• Model for the relationship is then: Y = f (X ) + ε

• f is the systematic information, ε is the random error

Why estimate f ?

• Prediction

• Inference

Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference

Why estimate f ?

• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.

• Inference - We would like to understand how Y is affectedbby changes in X .

Predicting f ?

Create a f (X ) that generates a Y that estimates Y as closely aspossible by attempting to minimize the reducible error, acceptingthat there is some irreducible error generated by ε that isindependent of X .

Why is there irreducible error?

Estimating f

Assume we have an estimate f and a set of predictors X , such thatY = f (X ). Then we can calculate the expected value of theaverage error as:

E(Y − Y )2 = E[f (X ) + ε− f (X )]2 = [f (X )− f (X )]2 + Var(ε)

The focus of this class is to learn techniques to estimate f tominimize reducible error.

How do we estimate f

• Need: Training Data to learn from

How do we estimate f

• Need: Training Data to learn from

• Training data D = {(x1, y1), (x2, y2), · · · , (xn, yn)}, wherexi = (xi1, xi2, · · · , xip)T for n subjects and p dimensions.

• Note: The text uses notation N subjects and D dimensions• y can be:

• categorical or nominal → the problem we solve is classificationor pattern recognition

• continuous → the problem we solve is regression• categorical with order (grades A, B, C, D, F) → the problem

we solve is ordinal regression

Parametric Methods

• 2-step model based approach to estimate f .

• Step 1 - make an assumption about the form of f . Forexample, that f is linear

• f (X ) = β0 + β1x1 + β2x2 + · · ·+ βpxp• Advantage - only have to estimate p + 1 coefficients to modelf

• Step 2 - fit or train a model/procedure to estimate β

• The most comment method for this is Ordinary Least Squares.

Parametric vs. Non-parametric

Flexibiilty vs. Interpretability

Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

Model Accuracy

• Find some measure of error

• MSE = 1n

∑ni=1

(yi − f (xi )

)2• Methods trained to reduce MSE on training data D - is that

enough?

• why do we care about previously seen data? Instead we careabout unseen test data to understand out the model predicts.

• so how do we select methods that reduce MSE on test data?

Bias Variance Tradeoff

• Variance - the amount f would change with different trainingdata

• Bias - the error introduced by approximating f with a simplermodel

• E(y − f (x))2 = Var(f (x)) + [Bias(f (x))]2 + Var(ε)

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

Bias Variance Tradeoff

• Error Rate = 1n

∑ni=1 I(yi 6= yi )

• Test error is minimized by understanding the conditionalprobability p(Y = j |X = x0)

• Bayes Error Rate 1− E(maxjp(Y = j |X ))

• K-Nearest Neighbor to solve this:p(Y = j |X = x0) = 1

∑i∈N0

I(yi = j)

• More on this later!

Key Challenge: Avoid Overfitting

U-Shape Tradeoff

Multiclass Classification

• yi |y ∈ {1, 2, · · · ,C}, where, if C = 2, y ∈ {0, 1} and if C > 2we have a multi-class classification problem.

• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

• Maximum a Posteriori (MAP) estimate - find the mostprobable label

• y = f (x) = argmaxCc=1p(y = c |x ,D)

Another Example: Advanced Notation

• Suppose you want to model the cost of a used car

• inputs: x are the car attributes (brand, mileage, year, etc.)

• output: y price of the car

• learn: model parameters w (β in prior slides)

• Deterministic linear model: y = f (x |w) = wT x

• Deterministic non-linear model: y = f (x |w) = wTφ(x)

• Non-deterministic non-linear model:y = f (x |w) = wTφ(x) + ε where ε ∼ N(µ, σ2)

Another Example: Linear Regression

• Parameters θ = (w , σ2)

• Clustering - model cars by p(xi |θ) and see if there is a naturalgrouping of cars

• µ(x) = w0 + w1x = wT x

• Linear regression: p(y |x , θ) = N(y |wTφ(x), σ2) - the basisfunction expansion with a Gaussian Distribution

Logistic Regression

• Assume y ∈ 0, 1, then a Bernoulli distribution is moreaccurate than a Gaussian

• p(y |x ,w) = Ber(y |µ(x)), where µ(x) = E[y |x ] = p(y = 1|x)

• If we require 0 ≤ µ(x) ≤ 1, say by a sigmoid function

• µ(x) = sigm(wT x) = 1

1+exp−wT x(sometimes called the logit

or logistic function)

• Then can classify by a decision rule likey = 1↔ p(y = 1|x) > 0.5

Logistic Regression

• p(y |x , β) = sigm(βT x) = 1

1+exp−βT x

K Nearest Neighbor

• p(y = c|x ,D, k) = 1k

∑i∈Nk (x ,D) I(yi = c)

Curse of Dimensionality

•B Mortazavi CSE

No Free Lunch Theorem

• ”All models are wrong but some models are useful”, G. Box,1987

• No single ML system works for everything

• Principal Component Analysis a way to solve curse ofdimensionality

• Machine learning is not magic: it can’t get something out ofnothing, but it can get more from less!

Takeaways and Next Time

• What is the difference between supervised and unsupervisedlearning?

• What are some basics of supervised learning?

• Difference between parametric and non-parametric modeling

• Next Time: Probability, K-Nearest Neighbor, and Naive BayesClassification

• Sources: Machine Learning, A Probabilistic Perspective -Murphy, and An Introduction to Statistical Learning - Jameset al

csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...

Documents

csce 932, spring 2007

mechanism and machine...

csce oct newsletter

workflow expansion & dbms...

csce 510 - systems programming

csce annual report

csce 510 - systems programming lecture 18 sockets, really...

csce 201 - farkas1 csce 201 computer networks. csce 201 -...

computer graphics csce 441

csce 790 - ontology development lecture 1 - overview csce...

csce 314 programming languages...

csce 510 - systems programming lecture 02 - system calls...

csce 510 - systems programming lect. 11 –longjmps; file...

lee csce 314 tamu 1 csce 314 programming languages haskell...

machine(level,programming,iii:, switch,statements,,and...

csce 522 - farkas1 csce 522 network security. reading...

csce 410/611: virtualization - texas a&m university ·...

csce 510 - systems programming lecture 09 –shells csce 510...

computer science and engineering (csce)computer science and...

csce 633: machine learning...title csce 633: machine...