mle/mapmgormley/courses/10601bd-f18/... · mle/map 1 10-601 introduction to machine learning matt...
TRANSCRIPT
![Page 1: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/1.jpg)
MLE/MAP
1
10-601 Introduction to Machine Learning
Matt GormleyLecture 20
Oct 29, 2018
Machine Learning DepartmentSchool of Computer ScienceCarnegie Mellon University
![Page 2: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/2.jpg)
Q&A
9
![Page 3: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/3.jpg)
PROBABILISTIC LEARNING
11
![Page 4: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/4.jpg)
Probabilistic Learning
Function ApproximationPreviously, we assumed that our output was generated using a deterministic target function:
Our goal was to learn a hypothesis h(x) that best approximates c*(x)
Probabilistic LearningToday, we assume that our output is sampled from a conditional probability distribution:
Our goal is to learn a probability distribution p(y|x) that best approximates p*(y|x)
12
![Page 5: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/5.jpg)
Robotic Farming
13
Deterministic Probabilistic
Classification(binary output)
Is this a picture of a wheat kernel?
Is this plant drought resistant?
Regression(continuous output)
How many wheat kernels are in this picture?
What will the yield of this plant be?
![Page 6: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/6.jpg)
Oracles and SamplingWhiteboard
– Sampling from common probability distributions• Bernoulli• Categorical• Uniform• Gaussian
– Pretending to be an Oracle (Regression)• Case 1: Deterministic outputs• Case 2: Probabilistic outputs
– Probabilistic Interpretation of Linear Regression• Adding Gaussian noise to linear function• Sampling from the noise model
– Pretending to be an Oracle (Classification)• Case 1: Deterministic labels• Case 2: Probabilistic outputs (Logistic Regression)• Case 3: Probabilistic outputs (Gaussian Naïve Bayes)
15
![Page 7: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/7.jpg)
In-Class Exercise
1. With your neighbor, write a function which returns samples from a Categorical– Assume access to the rand() function– Function signature should be:categorical_sample(theta)where theta is the array of parameters
– Make your implementation as efficient as possible!
2. What is the expected runtime of your function?
16
![Page 8: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/8.jpg)
Generative vs. Discrminative
Whiteboard– Generative vs. Discriminative Models• Chain rule of probability• Maximum (Conditional) Likelihood Estimation for
Discriminative models• Maximum Likelihood Estimation for Generative
models
17
![Page 9: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/9.jpg)
Categorical Distribution
Whiteboard– Categorical distribution details• Independent and Identically Distributed (i.i.d.)• Example: Dice Rolls
18
![Page 10: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/10.jpg)
Takeaways• One view of what ML is trying to accomplish is
function approximation• The principle of maximum likelihood
estimation provides an alternate view of learning
• Synthetic data can help debug ML algorithms• Probability distributions can be used to model
real data that occurs in the world(don’t worry we’ll make our distributions more interesting soon!)
19
![Page 11: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/11.jpg)
Learning ObjectivesOracles, Sampling, Generative vs. Discriminative
You should be able to…1. Sample from common probability distributions2. Write a generative story for a generative or
discriminative classification or regression model3. Pretend to be a data generating oracle4. Provide a probabilistic interpretation of linear
regression5. Use the chain rule of probability to contrast
generative vs. discriminative modeling6. Define maximum likelihood estimation (MLE) and
maximum conditional likelihood estimation (MCLE)
20
![Page 12: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/12.jpg)
PROBABILITY
21
![Page 13: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/13.jpg)
Random Variables: Definitions
22
Discrete RandomVariable
Random variable whose values come from a countable set (e.g. the natural numbers or {True, False})
Probability mass function (pmf)
Function giving the probability that discrete r.v. X takes value x.
X
p(x) := P (X = x)
p(x)
![Page 14: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/14.jpg)
Random Variables: Definitions
23
Continuous RandomVariable
Random variable whose values come from an interval or collection of intervals (e.g. the real numbers or the range (3, 5))
Probability density function (pdf)
Function the returns a nonnegative real indicating the relative likelihood that a continuous r.v. X takes value x
X
f(x)
• For any continuous random variable: P(X = x) = 0• Non-zero probabilities are only available to intervals:
P (a � X � b) =
� b
af(x)dx
![Page 15: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/15.jpg)
Random Variables: Definitions
24
Cumulativedistribution function
Function that returns the probability that a random variable X is less than or equal to x:
F (x)
F (x) = P (X � x)
• For discrete random variables:
• For continuous random variables:
F (x) = P (X � x) =�
x�<x
P (X = x�) =�
x�<x
p(x�)
F (x) = P (X � x) =
� x
��f(x�)dx�
![Page 16: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/16.jpg)
Notational Shortcuts
25
P (A|B) =P (A, B)
P (B)
� For all values of a and b:
P (A = a|B = b) =P (A = a, B = b)
P (B = b)
A convenient shorthand:
![Page 17: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/17.jpg)
Notational Shortcuts
But then how do we tell P(E) apart from P(X) ?
26
Event RandomVariable
P (A|B) =P (A, B)
P (B)Instead of writing:
We should write:PA|B(A|B) =
PA,B(A, B)
PB(B)
…but only probability theory textbooks go to such lengths.
![Page 18: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/18.jpg)
COMMON PROBABILITY DISTRIBUTIONS
27
![Page 19: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/19.jpg)
Common Probability Distributions• For Discrete Random Variables:
– Bernoulli– Binomial– Multinomial– Categorical– Poisson
• For Continuous Random Variables:– Exponential– Gamma– Beta– Dirichlet– Laplace– Gaussian (1D)– Multivariate Gaussian
28
![Page 20: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/20.jpg)
Common Probability Distributions
Beta Distribution
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053
Shared Components Topic Models
Anonymous Author(s)AffiliationAddressemail
1 Distributions
f(⌅|�,⇥) =1
B(�,⇥)x��1(1� x)⇥�1
2 SCTM
A Product of Experts (PoE) [1] model p(x|⇥1, . . . ,⇥C) =QC
c=1 ⌅cxPVv=1
QCc=1 ⌅cv
, where there are C
components, and the summation in the denominator is over all possible feature types.
Latent Dirichlet allocation generative process
For each topic k ⇤ {1, . . . , K}:�k ⇥ Dir(�) [draw distribution over words]
For each document m ⇤ {1, . . . , M}✓m ⇥ Dir(↵) [draw distribution over topics]For each word n ⇤ {1, . . . , Nm}
zmn ⇥ Mult(1, ✓m) [draw topic]xmn ⇥ �zmi
[draw word]
The Finite IBP model generative process
For each component c ⇤ {1, . . . , C}: [columns]
⇤c ⇥ Beta( �C , 1) [draw probability of component c]
For each topic k ⇤ {1, . . . , K}: [rows]bkc ⇥ Bernoulli(⇤c)[draw whether topic includes cth component in its PoE]
2.1 PoE
p(x|⇥1, . . . ,⇥C) =⇥C
c=1 ⌅cx�Vv=1
⇥Cc=1 ⌅cv
(1)
2.2 IBP
Latent Dirichlet allocation generative process
For each topic k ⇤ {1, . . . , K}:�k ⇥ Dir(�) [draw distribution over words]
For each document m ⇤ {1, . . . , M}✓m ⇥ Dir(↵) [draw distribution over topics]For each word n ⇤ {1, . . . , Nm}
zmn ⇥ Mult(1, ✓m) [draw topic]xmn ⇥ �zmi
[draw word]
The Beta-Bernoulli model generative process
For each feature c ⇤ {1, . . . , C}: [columns]
⇤c ⇥ Beta( �C , 1)
For each class k ⇤ {1, . . . , K}: [rows]bkc ⇥ Bernoulli(⇤c)
2.3 Shared Components Topic Models
Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution ⇥c over the V words from a Dirichletparametrized by �. Next, we generate a K ⇥ C binary matrix using the finite IBP prior. We selectthe probability ⇤c of each component c being on (bkc = 1) from a Beta distribution parametrized
1
0
1
2
3
4
f(�
|↵,�
)
0 0.2 0.4 0.6 0.8 1�
↵ = 0.1,� = 0.9↵ = 0.5,� = 0.5↵ = 1.0,� = 1.0↵ = 5.0,� = 5.0↵ = 10.0,� = 5.0
probability density function:
![Page 21: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/21.jpg)
Common Probability Distributions
Dirichlet Distribution
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053
Shared Components Topic Models
Anonymous Author(s)AffiliationAddressemail
1 Distributions
f(⌅|�,⇥) =1
B(�,⇥)x��1(1� x)⇥�1
2 SCTM
A Product of Experts (PoE) [1] model p(x|⇥1, . . . ,⇥C) =QC
c=1 ⌅cxPVv=1
QCc=1 ⌅cv
, where there are C
components, and the summation in the denominator is over all possible feature types.
Latent Dirichlet allocation generative process
For each topic k ⇤ {1, . . . , K}:�k ⇥ Dir(�) [draw distribution over words]
For each document m ⇤ {1, . . . , M}✓m ⇥ Dir(↵) [draw distribution over topics]For each word n ⇤ {1, . . . , Nm}
zmn ⇥ Mult(1, ✓m) [draw topic]xmn ⇥ �zmi
[draw word]
The Finite IBP model generative process
For each component c ⇤ {1, . . . , C}: [columns]
⇤c ⇥ Beta( �C , 1) [draw probability of component c]
For each topic k ⇤ {1, . . . , K}: [rows]bkc ⇥ Bernoulli(⇤c)[draw whether topic includes cth component in its PoE]
2.1 PoE
p(x|⇥1, . . . ,⇥C) =⇥C
c=1 ⌅cx�Vv=1
⇥Cc=1 ⌅cv
(1)
2.2 IBP
Latent Dirichlet allocation generative process
For each topic k ⇤ {1, . . . , K}:�k ⇥ Dir(�) [draw distribution over words]
For each document m ⇤ {1, . . . , M}✓m ⇥ Dir(↵) [draw distribution over topics]For each word n ⇤ {1, . . . , Nm}
zmn ⇥ Mult(1, ✓m) [draw topic]xmn ⇥ �zmi
[draw word]
The Beta-Bernoulli model generative process
For each feature c ⇤ {1, . . . , C}: [columns]
⇤c ⇥ Beta( �C , 1)
For each class k ⇤ {1, . . . , K}: [rows]bkc ⇥ Bernoulli(⇤c)
2.3 Shared Components Topic Models
Generative process We can now present the formal generative process for the SCTM. For eachof the C shared components, we generate a distribution ⇥c over the V words from a Dirichletparametrized by �. Next, we generate a K ⇥ C binary matrix using the finite IBP prior. We selectthe probability ⇤c of each component c being on (bkc = 1) from a Beta distribution parametrized
1
0
1
2
3
4
f(�
|↵,�
)
0 0.2 0.4 0.6 0.8 1�
↵ = 0.1,� = 0.9↵ = 0.5,� = 0.5↵ = 1.0,� = 1.0↵ = 5.0,� = 5.0↵ = 10.0,� = 5.0
probability density function:
![Page 22: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/22.jpg)
Common Probability Distributions
Dirichlet Distribution
000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053
Shared Components Topic Models
Anonymous Author(s)AffiliationAddressemail
1 Distributions
Beta
f(⇤|�,⇥) =1
B(�,⇥)x��1(1� x)⇥�1
Dirichlet
p(⌅⇤|�) =1
B(�)
K⇤
k=1
⇤�k�1k where B(�) =
⇥Kk=1 �(�k)
�(�K
k=1 �k)(1)
2 SCTM
A Product of Experts (PoE) [1] model p(x|⇥1, . . . ,⇥C) =QC
c=1 ⌅cxPVv=1
QCc=1 ⌅cv
, where there are C
components, and the summation in the denominator is over all possible feature types.
Latent Dirichlet allocation generative process
For each topic k ⇤ {1, . . . , K}:�k ⇥ Dir(�) [draw distribution over words]
For each document m ⇤ {1, . . . , M}✓m ⇥ Dir(↵) [draw distribution over topics]For each word n ⇤ {1, . . . , Nm}
zmn ⇥ Mult(1, ✓m) [draw topic]xmn ⇥ �zmi
[draw word]
The Finite IBP model generative process
For each component c ⇤ {1, . . . , C}: [columns]
⇤c ⇥ Beta( �C , 1) [draw probability of component c]
For each topic k ⇤ {1, . . . , K}: [rows]bkc ⇥ Bernoulli(⇤c)[draw whether topic includes cth component in its PoE]
2.1 PoE
p(x|⇥1, . . . ,⇥C) =⇥C
c=1 ⇤cx�Vv=1
⇥Cc=1 ⇤cv
(2)
2.2 IBP
Latent Dirichlet allocation generative process
For each topic k ⇤ {1, . . . , K}:�k ⇥ Dir(�) [draw distribution over words]
For each document m ⇤ {1, . . . , M}✓m ⇥ Dir(↵) [draw distribution over topics]For each word n ⇤ {1, . . . , Nm}
zmn ⇥ Mult(1, ✓m) [draw topic]xmn ⇥ �zmi
[draw word]
The Beta-Bernoulli model generative process
For each feature c ⇤ {1, . . . , C}: [columns]
⇤c ⇥ Beta( �C , 1)
For each class k ⇤ {1, . . . , K}: [rows]bkc ⇥ Bernoulli(⇤c)
1
00.2
0.40.6
0.81
�2
0
0.25
0.5
0.75
1
�1
1.5
2
2.5
3
p( ~�|~↵)
00.2
0.40.6
0.81
�2
0
0.25
0.5
0.75
1
�1
0
5
10
15
p( ~�|~↵)
probability density function:
![Page 23: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/23.jpg)
EXPECTATION AND VARIANCE
32
![Page 24: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/24.jpg)
Expectation and Variance
33
• Discrete random variables:
E[X] =�
x�Xxp(x)
Suppose X can take any value in the set X .
• Continuous random variables:
E[X] =
� +�
��xf(x)dx
The expected value of X is E[X]. Also called the mean.
![Page 25: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/25.jpg)
Expectation and Variance
34
The variance of X is Var(X).V ar(X) = E[(X � E[X])2]
• Discrete random variables:
V ar(X) =�
x�X(x � µ)2p(x)
• Continuous random variables:
V ar(X) =
� +�
��(x � µ)2f(x)dx
µ = E[X]
![Page 26: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/26.jpg)
MULTIPLE RANDOM VARIABLES
Joint probabilityMarginal probabilityConditional probability
35
![Page 27: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/27.jpg)
Joint Probability
36
Means, Variances and Covariances
• Remember the definition of the mean and covariance of a vectorrandom variable:
E[x] =
!
xxp(x)dx = m
Cov[x] = E[(x−m)(x−m)⊤] =
!
x(x−m)(x−m)⊤p(x)dx = V
which is the expected value of the outer product of the variablewith itself, after subtracting the mean.
• Also, the covariance between two variables:
Cov[x,y] = E[(x−mx)(y −my)⊤] = C
=
!
xy(x−mx)(y −my)⊤p(x,y)dxdy = C
which is the expected value of the outer product of one variablewith another, after subtracting their means.Note: C is not symmetric.
Joint Probability
• Key concept: two or more random variables may interact.Thus, the probability of one taking on a certain value depends onwhich value(s) the others are taking.
•We call this a joint ensemble and writep(x, y) = prob(X = x and Y = y)
x
y
z
p(x,y,z)
Marginal Probabilities
•We can ”sum out” part of a joint distribution to get the marginaldistribution of a subset of variables:
p(x) ="
y
p(x, y)
• This is like adding slices of the table together.
x
y
z
x
y
zΣp(x,y)
• Another equivalent definition: p(x) =#
y p(x|y)p(y).
Conditional Probability
• If we know that some event has occurred, it changes our beliefabout the probability of other events.
• This is like taking a ”slice” through the joint table.
p(x|y) = p(x, y)/p(y)
x
y
z
p(x,y|z)
Slide from Sam Roweis (MLSS, 2005)
![Page 28: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/28.jpg)
Marginal Probabilities
37
Means, Variances and Covariances
• Remember the definition of the mean and covariance of a vectorrandom variable:
E[x] =
!
xxp(x)dx = m
Cov[x] = E[(x−m)(x−m)⊤] =
!
x(x−m)(x−m)⊤p(x)dx = V
which is the expected value of the outer product of the variablewith itself, after subtracting the mean.
• Also, the covariance between two variables:
Cov[x,y] = E[(x−mx)(y −my)⊤] = C
=
!
xy(x−mx)(y −my)⊤p(x,y)dxdy = C
which is the expected value of the outer product of one variablewith another, after subtracting their means.Note: C is not symmetric.
Joint Probability
• Key concept: two or more random variables may interact.Thus, the probability of one taking on a certain value depends onwhich value(s) the others are taking.
•We call this a joint ensemble and writep(x, y) = prob(X = x and Y = y)
x
y
z
p(x,y,z)
Marginal Probabilities
•We can ”sum out” part of a joint distribution to get the marginaldistribution of a subset of variables:
p(x) ="
y
p(x, y)
• This is like adding slices of the table together.
x
y
z
x
y
zΣp(x,y)
• Another equivalent definition: p(x) =#
y p(x|y)p(y).
Conditional Probability
• If we know that some event has occurred, it changes our beliefabout the probability of other events.
• This is like taking a ”slice” through the joint table.
p(x|y) = p(x, y)/p(y)
x
y
z
p(x,y|z)
Slide from Sam Roweis (MLSS, 2005)
![Page 29: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/29.jpg)
Conditional Probability
38Slide from Sam Roweis (MLSS, 2005)
Means, Variances and Covariances
• Remember the definition of the mean and covariance of a vectorrandom variable:
E[x] =
!
xxp(x)dx = m
Cov[x] = E[(x−m)(x−m)⊤] =
!
x(x−m)(x−m)⊤p(x)dx = V
which is the expected value of the outer product of the variablewith itself, after subtracting the mean.
• Also, the covariance between two variables:
Cov[x,y] = E[(x−mx)(y −my)⊤] = C
=
!
xy(x−mx)(y −my)⊤p(x,y)dxdy = C
which is the expected value of the outer product of one variablewith another, after subtracting their means.Note: C is not symmetric.
Joint Probability
• Key concept: two or more random variables may interact.Thus, the probability of one taking on a certain value depends onwhich value(s) the others are taking.
•We call this a joint ensemble and writep(x, y) = prob(X = x and Y = y)
x
y
z
p(x,y,z)
Marginal Probabilities
•We can ”sum out” part of a joint distribution to get the marginaldistribution of a subset of variables:
p(x) ="
y
p(x, y)
• This is like adding slices of the table together.
x
y
z
x
y
zΣp(x,y)
• Another equivalent definition: p(x) =#
y p(x|y)p(y).
Conditional Probability
• If we know that some event has occurred, it changes our beliefabout the probability of other events.
• This is like taking a ”slice” through the joint table.
p(x|y) = p(x, y)/p(y)
x
y
z
p(x,y|z)
![Page 30: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/30.jpg)
Independence and Conditional Independence
39
Bayes’ Rule
•Manipulating the basic definition of conditional probability givesone of the most important formulas in probability theory:
p(x|y) =p(y|x)p(x)
p(y)=
p(y|x)p(x)!
x′ p(y|x′)p(x′)
• This gives us a way of ”reversing”conditional probabilities.
• Thus, all joint probabilities can be factored by selecting an orderingfor the random variables and using the ”chain rule”:
p(x, y, z, . . .) = p(x)p(y|x)p(z|x, y)p(. . . |x, y, z)
Independence & Conditional Independence
• Two variables are independent iff their joint factors:
p(x, y) = p(x)p(y)p(x,y)
=x
p(y)
p(x)
• Two variables are conditionally independent given a third one if forall values of the conditioning variable, the resulting slice factors:
p(x, y|z) = p(x|z)p(y|z) ∀z
Entropy
•Measures the amount of ambiguity or uncertainty in a distribution:
H(p) = −"
x
p(x) log p(x)
• Expected value of − log p(x) (a function which depends on p(x)!).
•H(p) > 0 unless only one possible outcomein which case H(p) = 0.
•Maximal value when p is uniform.
• Tells you the expected ”cost” if each event costs − log p(event)
Cross Entropy (KL Divergence)
• An assymetric measure of the distancebetween two distributions:
KL[p∥q] ="
x
p(x)[log p(x)− log q(x)]
•KL > 0 unless p = q then KL = 0
• Tells you the extra cost if events were generated by p(x) butinstead of charging under p(x) you charged under q(x).
Slide from Sam Roweis (MLSS, 2005)
![Page 31: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/31.jpg)
MLE AND MAP
40
![Page 32: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/32.jpg)
MLE
41
Suppose we have dataD = {x(i)}Ni=1
�MLE =�
N�
i=1
p( (i)|�)
�MAP =�
N�
i=1
p( (i)|�)p(�)
Principle of Maximum Likelihood Estimation:Choose the parameters that maximize the likelihood of the data.
�MLE =�
N�
i=1
p( (i)|�)
Maximum Likelihood Estimate (MLE)
![Page 33: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/33.jpg)
MLE
What does maximizing likelihood accomplish?• There is only a finite amount of probability
mass (i.e. sum-to-one constraint)• MLE tries to allocate as much probability
mass as possible to the things we have observed…
…at the expense of the things we have notobserved
42
![Page 34: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/34.jpg)
MLE
Example: MLE of Exponential Distribution
43
• pdf of Exponential(�): f(x) = �e��x
• Suppose Xi � Exponential(�) for 1 � i � N .• Find MLE for data D = {x(i)}N
i=1
• First write down log-likelihood of sample.• Compute first derivative, set to zero, solve for �.• Compute second derivative and check that it is
concave down at �MLE.
![Page 35: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/35.jpg)
MLE
Example: MLE of Exponential Distribution
44
• First write down log-likelihood of sample.
�(�) =N�
i=1
f(x(i)) (1)
=N�
i=1
(� (��x(i))) (2)
=N�
i=1
(�) + ��x(i) (3)
= N (�) � �N�
i=1
x(i) (4)
![Page 36: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/36.jpg)
MLE
Example: MLE of Exponential Distribution
45
• Compute first derivative, set to zero, solve for �.
d�(�)
d�=
d
d�N (�) � �
N�
i=1
x(i) (1)
=N
��
N�
i=1
x(i) = 0 (2)
� �MLE =N
�Ni=1 x(i)
(3)
![Page 37: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/37.jpg)
MLE
Example: MLE of Exponential Distribution
46
• pdf of Exponential(�): f(x) = �e��x
• Suppose Xi � Exponential(�) for 1 � i � N .• Find MLE for data D = {x(i)}N
i=1
• First write down log-likelihood of sample.• Compute first derivative, set to zero, solve for �.• Compute second derivative and check that it is
concave down at �MLE.
![Page 38: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/38.jpg)
MLE
In-Class ExerciseShow that the MLE of parameter ɸ for N samples drawn from Bernoulli(ɸ) is:
47
Steps to answer:1. Write log-likelihood
of sample2. Compute derivative
w.r.t. ɸ3. Set derivative to
zero and solve for ɸ
![Page 39: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/39.jpg)
Learning from Data (Frequentist)
Whiteboard– Optimization for MLE– Examples: 1D and 2D optimization– Example: MLE of Bernoulli– Example: MLE of Categorical– Aside: Method of Langrange Multipliers
48
![Page 40: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/40.jpg)
MLE vs. MAP
49
Suppose we have dataD = {x(i)}Ni=1
�MLE =�
N�
i=1
p( (i)|�)
�MAP =�
N�
i=1
p( (i)|�)p(�)
Principle of Maximum a posteriori (MAP) Estimation:Choose the parameters that maximize the posterior of the parameters given the data.
Principle of Maximum Likelihood Estimation:Choose the parameters that maximize the likelihood of the data.
�MLE =�
N�
i=1
p( (i)|�)
Maximum Likelihood Estimate (MLE)
Maximum a posteriori (MAP) estimate
![Page 41: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/41.jpg)
MLE vs. MAP
50
Suppose we have dataD = {x(i)}Ni=1
�MLE =�
N�
i=1
p( (i)|�)
�MAP =�
N�
i=1
p( (i)|�)p(�)
Principle of Maximum a posteriori (MAP) Estimation:Choose the parameters that maximize the posterior of the parameters given the data.
Principle of Maximum Likelihood Estimation:Choose the parameters that maximize the likelihood of the data.
�MLE =�
N�
i=1
p( (i)|�)
Maximum Likelihood Estimate (MLE)
�MAP =�
N�
i=1
p( (i)|�)p(�)
Maximum a posteriori (MAP) estimate
Prior
![Page 42: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/42.jpg)
Learning from Data (Bayesian)
Whiteboard– maximum a posteriori (MAP) estimation– Optimization for MAP– Example: MAP of Bernoulli—Beta
51
![Page 43: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/43.jpg)
Takeaways• One view of what ML is trying to accomplish is
function approximation• The principle of maximum likelihood
estimation provides an alternate view of learning
• Synthetic data can help debug ML algorithms• Probability distributions can be used to model
real data that occurs in the world(don’t worry we’ll make our distributions more interesting soon!)
52
![Page 44: MLE/MAPmgormley/courses/10601bd-f18/... · MLE/MAP 1 10-601 Introduction to Machine Learning Matt Gormley Lecture 20 Oct 29, 2018 Machine Learning Department School of Computer Science](https://reader035.vdocuments.us/reader035/viewer/2022071102/5fdb9d74a5f33e44ac78fde0/html5/thumbnails/44.jpg)
Learning ObjectivesMLE / MAP
You should be able to…1. Recall probability basics, including but not limited to: discrete
and continuous random variables, probability mass functions, probability density functions, events vs. random variables, expectation and variance, joint probability distributions, marginal probabilities, conditional probabilities, independence, conditional independence
2. Describe common probability distributions such as the Beta, Dirichlet, Multinomial, Categorical, Gaussian, Exponential, etc.
3. State the principle of maximum likelihood estimation and explain what it tries to accomplish
4. State the principle of maximum a posteriori estimation and explain why we use it
5. Derive the MLE or MAP parameters of a simple model in closed form
53