machine learning cse446 - courses.cs.washington.edu
TRANSCRIPT
1
https://courses.cs.washington.edu/446
Machine Learning CSE446
Kevin Jamieson and Anna Karlin University of Washington
April 1, 2019
Traditional algorithms
Reddit Google Twitter?Social media mentions of Cats vs. Dogs
Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1
Traditional algorithms
Reddit Google Twitter?Social media mentions of Cats vs. Dogs
Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1
Write a program that sorts tweets into those containing âcatâ, âdogâ, or other
Traditional algorithms
Reddit Google Twitter?Social media mentions of Cats vs. Dogs
Graphics courtesy of https://theoutline.com/post/3128/dogs-cats-internet-popularity?zd=1
for tweet in tweets:
cats = []dogs = []
if âcatâ in tweet:cats.append(tweet)
elseif âdogâ in tweet:
other = []
dogs.append(tweet)else:
other.append(tweet)return cats, dogs, otherWrite a program that sorts
tweets into those containing âcatâ, âdogâ, or other
Write a program that sorts images into those containing âbirdsâ, âairplanesâ, or other.
Machine learning algorithms
airplaneotherbird
Write a program that sorts images into those containing âbirdsâ, âairplanesâ, or other.
for image in images:
birds = []planes = []
if bird in image:birds.append(image)
elseif plane in image:
other = []
planes.append(image)else:
other.append(tweet)return birds, planes, other
Machine learning algorithms
airplaneotherbird
feature 1
feat
ure
2
Write a program that sorts images into those containing âbirdsâ, âairplanesâ, or other.
Machine learning algorithms
for image in images:
birds = []planes = []
if bird in image:birds.append(image)
elseif plane in image:
other = []
planes.append(image)else:
other.append(tweet)return birds, planes, other
airplaneotherbird
feature 1
feat
ure
2Machine learning algorithms
Write a program that sorts images into those containing âbirdsâ, âairplanesâ, or other.
airplaneotherbird
for image in images:
birds = []planes = []
if bird in image:birds.append(image)
elseif plane in image:
other = []
planes.append(image)else:
other.append(tweet)return birds, planes, other
feature 1
feat
ure
2Machine learning algorithms
Write a program that sorts images into those containing âbirdsâ, âairplanesâ, or other.
airplaneotherbird
for image in images:
birds = []planes = []
if bird in image:birds.append(image)
elseif plane in image:
other = []
planes.append(image)else:
other.append(tweet)return birds, planes, other
feature 1
feat
ure
2Machine learning algorithms
Write a program that sorts images into those containing âbirdsâ, âairplanesâ, or other.
airplaneotherbird
for image in images:
birds = []planes = []
if bird in image:birds.append(image)
elseif plane in image:
other = []
planes.append(image)else:
other.append(tweet)return birds, planes, other
The decision rule of if bird in image: is LEARNED using DATA
The decision rule of if âcatâ in tweet: is hard coded by expert.
Machine Learning Ingredients
⢠Data: past observations
⢠Hypotheses/Models: devised to capture the patterns in data
⢠Prediction: apply model to forecast future observations
Mix of statistics (theory) and algorithms (programming)
RegressionPredict continuous value: ex: stock market, credit score, temperature, Netflix rating
ClassificationPredict categorical value: loan or not? spam or not? what disease is this?
Unsupervised Learning
Predict structure: tree of life from DNA, find similar images, community detection
Flavors of ML
What this class is:⢠Fundamentals of ML: bias/variance tradeoff, overfitting,
optimization and computational tradeoffs, supervised learning (e.g., linear, boosting, deep learning), unsupervised models (e.g. k-means, EM, PCA)
⢠Preparation for further learning: the field is fast-moving, you will be able to apply the basics and teach yourself the latest
What this class is not:⢠Survey course: laundry list of algorithms, how to win Kaggle ⢠An easy course: familiarity with intro linear algebra and probability
are assumed, homework will be time-consuming
CSE446: Machine LearningLecture: Monday, Wednesday 3:30-4:50 Room: GUG 220 Instructor: Kevin Jamieson and Anna Karlin Contact: [email protected] Website: https://courses.cs.washington.edu/courses/cse446/19sp/
15Š2019 Kevin Jamieson
Prerequisites
â Formally: CSE 312, MATH 308, STAT 390 or equivalent
â Familiarity with: â Linear algebra
â linear dependence, rank, linear equations â Multivariate calculus â Probability and statistics
Distributions, densities, marginalization, moments â Algorithms
Basic data structures, complexity â âCan I learn these topics concurrently?â â Use HW0 to judge skills â See website for review materials!
16Š2019 Kevin Jamieson
Grading
â 5 homeworks (50%) Each contains both theoretical questions and will have programming Collaboration okay. You must write, submit, and understand your answers and code (which we may run) Do not Google for answers.
â Midterm May 8 (20%) and Final June 13 (30%)
17Š2019 Kevin Jamieson
Homeworks
HW 0 is out (Due next Monday Midnight, 37 points) Short review Work individually, treat as barometer for readiness
HW 1,2,3,4 (100 points each) They are not easy or short. Start early.
Submit to Gradescope Regrade requests on Gradescope There is no credit for late work, receives 0 points.
1. All code must be written in Python 2. All written work must be typeset (e.g., LaTeX)
See course website for tutorials and references.
18Š2019 Kevin Jamieson
Communication Channels
â Announcements, questions about class, homework help Piazza (get code on Canvas) Section Office hours (start tomorrow)
â Regrade requests Directly to Gradescope
â Personal concerns Email: [email protected]
â Anonymous feedback See website for link
19Š2019 Kevin Jamieson
Staff
â 13 Experienced TAs, lots of office hours (see website) ⢠Satvik Agarwal, Kousuke Ariga, Eric
Chan, Benjamin Evans, Shobhit Hathi, Zeyu Liu, Andrew Luo, Vardhman Mehta, Cheng Ni, Deric Pang, Robbie Weber, Kyle Zhang and Michael Zhang
20Š2017 Kevin Jamieson
Text Books
â Required Textbook: Machine Learning: a Probabilistic Perspective; Kevin Murphy
â Optional Books (free PDF): The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Trevor Hastie, Robert Tibshirani, Jerome Friedman
21Š2017 Kevin Jamieson
Add code requests
Add codes are given out according to a centralized process organized by CSE. Do not email instructors, we cannot help.
Resources: - https://www.cs.washington.edu/academics/ugrad/advising/ - https://www.cs.washington.edu/academics/ugrad/courses/petition
22Š2019 Kevin Jamieson
Enjoy!â ML is becoming ubiquitous in science, engineering and beyond â Itâs one of the hottest topics in industry today â This class should give you the basic foundation for applying
ML and developing new methods â The fun beginsâŚ
Š2019 Kevin Jamieson 23
Maximum Likelihood Estimation
Machine Learning â CSE446 Kevin Jamieson University of Washington
April 1, 2019
24Š2019 Kevin Jamieson
Your first consulting job
Billionaire: I have a special coin, if I flip it, whatâs the probability it will be heads? You: Please flip it a few times:
You: The probability is:
Billionaire: Why?
25Š2019 Kevin Jamieson
Coin â Binomial Distribution
â Data: sequence D= (HHTHTâŚ), k heads out of n flipsâ Hypothesis: P(Heads) = θ, P(Tails) = 1-θ
Flips are i.i.d.:Independent eventsIdentically distributed according to Binomial distribution
â P (D|â) =
26Š2019 Kevin Jamieson
Maximum Likelihood Estimation
â Data: sequence D= (HHTHTâŚ), k heads out of n flipsâ Hypothesis: P(Heads) = θ, P(Tails) = 1-θ
â Maximum likelihood estimation (MLE): Choose θ that maximizes the probability of observed data:
P (D|â) = âk(1ďż˝ â)nďż˝k
bâMLE = argmaxâ
P (D|â)
= argmaxâ
logP (D|â)P (D|â)
â
bâMLE
27Š2019 Kevin Jamieson
Your first learning algorithm
â Set derivative to zero: d
dâlogP (D|â) = 0
bâMLE = argmaxâ
logP (D|â)
= argmaxâ
log âk(1ďż˝ â)nďż˝k
28Š2019 Kevin Jamieson
How many flips do I need?
â You: flip the coin 5 times. Billionaire: I got 3 heads.
â You: flip the coin 50 times. Billionaire: I got 20 heads.
â Billionaire: Which one is right? Why?
bâMLE =k
n
bâMLE =
bâMLE =
29Š2019 Kevin Jamieson
Simple bound (based on Hoeffdingâs inequality)
â For n flips and k heads the MLE is unbiased for true θ*:
â Hoeffdingâs inequality says that for any Îľ>0:
bâMLE =k
nE[bâMLE ] = ââ¤
P (|bâMLE ďż˝ ââ¤| ďż˝ â) 2eďż˝2nâ2
30Š2019 Kevin Jamieson
PAC Learningâ PAC: Probably Approximate Correct â Billionaire: I want to know the parameter θ*, within Îľ = 0.1, with
probability at least 1-δ = 0.95. How many flips?
P (|bâMLE ďż˝ ââ¤| ďż˝ â) 2eďż˝2nâ2
31Š2019 Kevin Jamieson
What about continuous variables?
â Billionaire: What if I am measuring a continuous variable? â You: Let me tell you about GaussiansâŚ
32Š2019 Kevin Jamieson
Some properties of Gaussiansâ affine transformation (multiplying by scalar and adding a
constant) X ~ N(Âľ,Ď2) Y = aX + b â Y ~ N(aÂľ+b,a2Ď2)
â Sum of Gaussians X ~ N(ÂľX,Ď2
X) Y ~ N(ÂľY,Ď2
Y)
Z = X+Y â Z ~ N(ÂľX+ÂľY, Ď2X+Ď2
Y)
33Š2019 Kevin Jamieson
MLE for Gaussian
â Prob. of i.i.d. samples D={x1,âŚ,xN} (e.g., exam scores):
â Log-likelihood of data:
P (D|Âľ,ďż˝) = P (x1, . . . , xn|Âľ,ďż˝)
=
â1
ďż˝p2âĄ
ân nY
i=1
e�(xi�¾)2
2�2
logP (D|Âľ,ďż˝) = ďż˝n log(ďż˝p2âĄ)ďż˝
nX
i=1
(xi ďż˝ Âľ)2
2�2
34Š2019 Kevin Jamieson
Your second learning algorithm:MLE for mean of a Gaussian
â Whatâs MLE for mean?
d
dÂľlogP (D|Âľ,ďż˝) = d
dÂľ
"�n log(�
p2âĄ)ďż˝
nX
i=1
(xi ďż˝ Âľ)2
2�2
#
35Š2019 Kevin Jamieson
MLE for varianceâ Again, set derivative to zero:
d
d�logP (D|¾,�) = d
dďż˝
"�n log(�
p2âĄ)ďż˝
nX
i=1
(xi ďż˝ Âľ)2
2�2
#
36Š2019 Kevin Jamieson
Learning Gaussian parametersâ MLE:
â MLE for the variance of a Gaussian is biased
Unbiased variance estimator:
bÂľMLE =1
n
nX
i=1
xi
c�2MLE =
1
n
nX
i=1
(xi ďż˝ bÂľMLE)2
E[c�2MLE ] 6= �2
c�2unbiased =
1
nďż˝ 1
nX
i=1
(xi ďż˝ bÂľMLE)2
37Š2019 Kevin Jamieson
Maximum Likelihood Estimation
Observe X1, X2, . . . , Xn drawn IID from f(x; â) for some âtrueâ â = ââ¤
Likelihood function Ln(â) =nY
i=1
f(Xi; â)
ln(â) = log(Ln(â)) =nX
i=1
log(f(Xi; â))Log-Likelihood function
Maximum Likelihood Estimator (MLE) bâMLE = argmaxâ
Ln(â)
38Š2019 Kevin Jamieson
Maximum Likelihood Estimation
Observe X1, X2, . . . , Xn drawn IID from f(x; â) for some âtrueâ â = ââ¤
Likelihood function Ln(â) =nY
i=1
f(Xi; â)
ln(â) = log(Ln(â)) =nX
i=1
log(f(Xi; â))Log-Likelihood function
Maximum Likelihood Estimator (MLE) bâMLE = argmaxâ
Ln(â)
Properties (under benign regularity conditionsâsmoothness, identifiability, etc.):
Asymptotically consistent and normal:bâMLEďż˝ââ¤
bse â N (0, 1)
Asymptotic Optimality, minimum variance (see Cramer-Rao lower bound)
Recap
â Learning is⌠Collect some data â E.g., coin flips
Choose a hypothesis class or model â E.g., binomial
Choose a loss function â E.g., data likelihood
Choose an optimization procedure â E.g., set derivative to zero to obtain MLE
Justifying the accuracy of the estimate â E.g., Hoeffdingâs inequality
39Š2019 Kevin Jamieson