csce 633: machine learning€¦ · csce 633: machine learning lecture 1: introduction and...
Post on 24-Sep-2020
1 Views
Preview:
TRANSCRIPT
CSCE 633: Machine Learning
Lecture 1: Introduction and Probability
Texas A&M University
8-28-18
Welcome to CSCE 633!
• About this class• Who are we?• Syllabus• Text and other references
• Introduction to Machine Learning
B Mortazavi CSECopyright 2018 1
Goals of this lecture
• Course structure
• Understanding difference between supervised and unsupervisedmachine learning
• Review of matrix operations
• Understanding differences in notation
• Understanding differences in some supervised models
B Mortazavi CSECopyright 2018 2
Who are we?
• Instructor• Bobak Mortazavi• bobakm@tamu.edu - Please put [CSCE 633] in subject line• Office Hours: W 10-11a R 11-12p 328A HRBB
• TA• Xiaohan Chen• chernxh@tamu.edu• Office Hours: M 10:15-11:15a W 4:30-5:30p 320 HRBB
• Discussion Board on eCampus
B Mortazavi CSECopyright 2018 3
Syllabus, Schedule, and eCampus
B Mortazavi CSECopyright 2018 4
Text Book
Textbook for this course
B Mortazavi CSECopyright 2018 5
Other useful references
B Mortazavi CSECopyright 2018 6
Assignments
• Homework will be compiled in Latex - you will turn in PDFs
• Programming for homework will be in R or Python.
• for Projects - if you need to use a different language pleaseexplain why
• Project teams will be 2-3 people per team.
• Project proposals will be a 2 page written document
• Project report will be an 8 page IEEE conference-style report
B Mortazavi CSECopyright 2018 7
Kahoot
• kahoot.com
• smartphone applications
• ability to interact in class!
• not graded
B Mortazavi CSECopyright 2018 8
Topics
• About this class• Introduction to Machine Learning
• What is machine learning?• Models and accuracy• Some general supervised learning
B Mortazavi CSECopyright 2018 9
What is machine learning?
B Mortazavi CSECopyright 2018 10
What is machine learning?
Big Data! 40 billion web pages, 100 hours of videos on YouTubeevery minute, genomes of 1000s of people.
Murphy defines Machine learning as a set of methods that canautomatically detect patterns in data, and then use the uncoveredpatterns to predict future data, or to perform other kindso fdecision making under uncertainty (such as planning how to collectmore data!).
B Mortazavi CSECopyright 2018 11
Some cool examples!
Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018
B Mortazavi CSECopyright 2018 12
Some cool examples!
Power P, Hobbs J, Ruiz H, Wei X, Lucey P. MythbustingSet-Pieces in Soccer. MIT Sloan Sports Conference 2018
B Mortazavi CSECopyright 2018 13
Some cool examples!
Waymo.com/tech
B Mortazavi CSECopyright 2018 14
Some cool examples!
Waymo.com/tech
B Mortazavi CSECopyright 2018 15
Some cool examples!
Waymo.com/tech
B Mortazavi CSECopyright 2018 16
What is machine learning?
Statistical learning (machine learning) is a vast set of tools tounderstand data.
B Mortazavi CSECopyright 2018 17
a brief review of linear algebra
X =
x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p
......
. . ....
xn,1 xn,2 · · · xn,p
where X is n × p (sometimes N × p or in text book N × D)
B Mortazavi CSECopyright 2018 18
a brief review of linear algebra
X =
x1,1 x1,2 · · · x1,px2,1 x2,2 · · · x2,p
......
. . ....
xn,1 xn,2 · · · xn,p
where X is n × p (sometimes N × p or in text book N × D)
Xj is column j, so we can re-write X as:
(X1,X2, · · · ,Xp) or
XT1
XT2...
XTn
B Mortazavi CSECopyright 2018 19
Matrix Multiplication
Suppose A ∈ Rr×d and B ∈ Rd×s
B Mortazavi CSECopyright 2018 20
Matrix Multiplication
Suppose A ∈ Rr×d and B ∈ Rd×s then (ABij) =∑d
k=1 aikbkj .For example, consider:
A =
(1 23 4
)and
B =
(5 67 8
), then
AB =
(1 23 4
)(5 67 8
)=
(1× 5 + 2× 7 1× 6 + 2× 83× 5 + 4× 7 3× 6 + 4× 8
)=(
19 2243 50
)
B Mortazavi CSECopyright 2018 21
Vectors
• if x ∈ Rp×1, what is the ”size” of this vector?
B Mortazavi CSECopyright 2018 22
Vector Norms
• if x ∈ Rp×1, what is the ”size” of this vector?• Mapping of Rn to R with the following properties:
•∥∥x∥∥ ≥ 0 for all x
•∥∥x∥∥ = 0↔ x = 0
•∥∥αx
∥∥ = |α|∥∥x∥∥ , α ∈ R
•∥∥x + y
∥∥ ≤ ∥∥x∥∥+∥∥y∥∥ for all x and y
B Mortazavi CSECopyright 2018 23
Vector Norms
• if x ∈ Rp×1, what is the ”size” of this vector?
• 1-norm∥∥x∥∥
1=∑p
i=1
∣∣xi ∣∣
B Mortazavi CSECopyright 2018 24
Vector Norms
• if x ∈ Rp×1, what is the ”size” of this vector?
• 1-norm∥∥x∥∥
1=∑p
i=1
∣∣xi ∣∣• 2-norm
∥∥x∥∥2
=√∑p
i=1 x2i
B Mortazavi CSECopyright 2018 25
Vector Norms
• if x ∈ Rp×1, what is the ”size” of this vector?
• 1-norm∥∥x∥∥
1=∑p
i=1
∣∣xi ∣∣• 2-norm
∥∥x∥∥2
=√∑p
i=1 x2i
• max-norm∥∥x∥∥∞ = max
1≤i≤p|xi |
B Mortazavi CSECopyright 2018 26
Vector Norms
• if x ∈ Rp×1, what is the ”size” of this vector?
• 1-norm∥∥x∥∥
1=∑p
i=1
∣∣xi ∣∣• 2-norm
∥∥x∥∥2
=√∑p
i=1 x2i
• max-norm∥∥x∥∥∞ = max
1≤i≤p|xi |
• Lp-norm (or p-norm):∥∥x∥∥p
=(∑
i∈N |xi |p) 1
p
B Mortazavi CSECopyright 2018 27
What is machine learning?
Statistical learning (machine learning) is a vast set of tools tounderstand data.
B Mortazavi CSECopyright 2018 28
What is machine learning?
Statistical learning (machine learning) is a vast set of tools tounderstand data.
• Supervised Modeling - where we aim to predict or estimate anoutput based upon our input data
• Unsupervised Modeling - where we aim to learn relationshipand structure in input data without the guidance of outputdata
• Reinforcement Learning - where we aim to learn optimal rules,by rewarding the modeling method when it correctly satisfiessome goal
B Mortazavi CSECopyright 2018 29
Advertising Example
Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:
• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).
• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)
• Can generalize input predictors to X1,X2, · · · ,Xp = X
B Mortazavi CSECopyright 2018 30
Advertising Example
Assume we are trying to improve sales of a product throughadvertising in three different markets, tv, newspapers, andonline/mobile. We define the advertising budget as:
• input variables X: consist of TV budget X1, newspaperbudget X2, and mobile budget X3. (also known as predictors,independent variables, features, covariates, or just variables).
• output variable Y: sales numbers. (also known as theresponse variable, or dependent variable)
• Can generalize input predictors to X1,X2, · · · ,Xp = X
• Model for the relationship is then: Y = f (X ) + ε
• f is the systematic information, ε is the random error
B Mortazavi CSECopyright 2018 31
Why estimate f ?
• Prediction
• Inference
B Mortazavi CSECopyright 2018 32
Why estimate f ?
• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.
• Inference
B Mortazavi CSECopyright 2018 33
Why estimate f ?
• Prediction - In many situations, X is available to us but Y isnot, would like to estimate Y as accurately as possible.
• Inference - We would like to understand how Y is affectedbby changes in X .
B Mortazavi CSECopyright 2018 34
Predicting f ?
Create a f (X ) that generates a Y that estimates Y as closely aspossible by attempting to minimize the reducible error, acceptingthat there is some irreducible error generated by ε that isindependent of X .
Why is there irreducible error?
B Mortazavi CSECopyright 2018 35
Estimating f
Assume we have an estimate f and a set of predictors X , such thatY = f (X ). Then we can calculate the expected value of theaverage error as:
E(Y − Y )2 = E[f (X ) + ε− f (X )]2 = [f (X )− f (X )]2 + Var(ε)
The focus of this class is to learn techniques to estimate f tominimize reducible error.
B Mortazavi CSECopyright 2018 36
How do we estimate f
• Need: Training Data to learn from
B Mortazavi CSECopyright 2018 37
How do we estimate f
• Need: Training Data to learn from
• Training data D = {(x1, y1), (x2, y2), · · · , (xn, yn)}, wherexi = (xi1, xi2, · · · , xip)T for n subjects and p dimensions.
• Note: The text uses notation N subjects and D dimensions• y can be:
• categorical or nominal → the problem we solve is classificationor pattern recognition
• continuous → the problem we solve is regression• categorical with order (grades A, B, C, D, F) → the problem
we solve is ordinal regression
B Mortazavi CSECopyright 2018 38
Parametric Methods
• 2-step model based approach to estimate f .
• Step 1 - make an assumption about the form of f . Forexample, that f is linear
• f (X ) = β0 + β1x1 + β2x2 + · · ·+ βpxp• Advantage - only have to estimate p + 1 coefficients to modelf
• Step 2 - fit or train a model/procedure to estimate β
• The most comment method for this is Ordinary Least Squares.
B Mortazavi CSECopyright 2018 39
Parametric vs. Non-parametric
B Mortazavi CSECopyright 2018 40
Flexibiilty vs. Interpretability
B Mortazavi CSECopyright 2018 41
Model Accuracy
• Find some measure of error
• MSE = 1n
∑ni=1
(yi − f (xi )
)2• Methods trained to reduce MSE on training data D - is that
enough?
B Mortazavi CSECopyright 2018 42
Model Accuracy
• Find some measure of error
• MSE = 1n
∑ni=1
(yi − f (xi )
)2• Methods trained to reduce MSE on training data D - is that
enough?
• why do we care about previously seen data? Instead we careabout unseen test data to understand out the model predicts.
• so how do we select methods that reduce MSE on test data?
B Mortazavi CSECopyright 2018 43
Bias Variance Tradeoff
• Variance - the amount f would change with different trainingdata
• Bias - the error introduced by approximating f with a simplermodel
• E(y − f (x))2 = Var(f (x)) + [Bias(f (x))]2 + Var(ε)
• Error Rate = 1n
∑ni=1 I(yi 6= yi )
B Mortazavi CSECopyright 2018 44
Bias Variance Tradeoff
• Error Rate = 1n
∑ni=1 I(yi 6= yi )
• Test error is minimized by understanding the conditionalprobability p(Y = j |X = x0)
• Bayes Error Rate 1− E(maxjp(Y = j |X ))
• K-Nearest Neighbor to solve this:p(Y = j |X = x0) = 1
K
∑i∈N0
I(yi = j)
B Mortazavi CSECopyright 2018 45
Key Challenge: Avoid Overfitting
B Mortazavi CSECopyright 2018 46
U-Shape Tradeoff
B Mortazavi CSECopyright 2018 47
Multiclass Classification
• yi |y ∈ {1, 2, · · · ,C}, where, if C = 2, y ∈ {0, 1} and if C > 2we have a multi-class classification problem.
•
B Mortazavi CSECopyright 2018 48
Multiclass Classification
• Maximum a Posteriori (MAP) estimate - find the mostprobable label
• y = f (x) = argmaxCc=1p(y = c |x ,D)
•
B Mortazavi CSECopyright 2018 49
Multiclass Classification
• Maximum a Posteriori (MAP) estimate - find the mostprobable label
• y = f (x) = argmaxCc=1p(y = c |x ,D)
•
B Mortazavi CSECopyright 2018 50
Another Example: Advanced Notation
• Suppose you want to model the cost of a used car
• inputs: x are the car attributes (brand, mileage, year, etc.)
• output: y price of the car
• learn: model parameters w (β in prior slides)
• Deterministic linear model: y = f (x |w) = wT x
• Deterministic non-linear model: y = f (x |w) = wTφ(x)
• Non-deterministic non-linear model:y = f (x |w) = wTφ(x) + ε where ε ∼ N(µ, σ2)
B Mortazavi CSECopyright 2018 51
Another Example: Linear Regression
• Parameters θ = (w , σ2)
• Clustering - model cars by p(xi |θ) and see if there is a naturalgrouping of cars
• µ(x) = w0 + w1x = wT x
• Linear regression: p(y |x , θ) = N(y |wTφ(x), σ2) - the basisfunction expansion with a Gaussian Distribution
B Mortazavi CSECopyright 2018 52
Logistic Regression
• Assume y ∈ 0, 1, then a Bernoulli distribution is moreaccurate than a Gaussian
• p(y |x ,w) = Ber(y |µ(x)), where µ(x) = E[y |x ] = p(y = 1|x)
• If we require 0 ≤ µ(x) ≤ 1, say by a sigmoid function
• µ(x) = sigm(wT x) = 1
1+exp−wT x(sometimes called the logit
or logistic function)
• Then can classify by a decision rule likey = 1↔ p(y = 1|x) > 0.5
B Mortazavi CSECopyright 2018 53
Logistic Regression
• p(y |x , β) = sigm(βT x) = 1
1+exp−βT x
•
B Mortazavi CSECopyright 2018 54
K Nearest Neighbor
• p(y = c|x ,D, k) = 1k
∑i∈Nk (x ,D) I(yi = c)
•
B Mortazavi CSECopyright 2018 55
Curse of Dimensionality
•B Mortazavi CSE
Copyright 2018 56
No Free Lunch Theorem
• ”All models are wrong but some models are useful”, G. Box,1987
• No single ML system works for everything
• Principal Component Analysis a way to solve curse ofdimensionality
• Machine learning is not magic: it can’t get something out ofnothing, but it can get more from less!
B Mortazavi CSECopyright 2018 57
Takeaways and Next Time
• What is the difference between supervised and unsupervisedlearning?
• What are some basics of supervised learning?
• Difference between parametric and non-parametric modeling
• Next Time: Probability, K-Nearest Neighbor, and Naive BayesClassification
• Sources: Machine Learning, A Probabilistic Perspective -Murphy, and An Introduction to Statistical Learning - Jameset al
B Mortazavi CSECopyright 2018 58
top related