iid samples in supervised learning, we usually assume that data points are sampled independently and...
TRANSCRIPT
![Page 1: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/1.jpg)
IID Samples•In supervised learning, we usually assume that data points are sampled independently and from the same distribution
•IID assumption: data are independent and identically distributed
•⇒ joint PDF can be written as product of individual (marginal) PDFs:
![Page 2: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/2.jpg)
The max likelihood recipe•Start with IID data
•Assume model for individual data point, f(X;Θ)
•Construct joint likelihood function (PDF):
•Find the params Θ that maximize L
•(If you’re lucky): Differentiate L w.r.t. Θ, set =0 and solve
•Repeat for each class
![Page 3: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/3.jpg)
Exercise•Find the maximum likelihood estimator of μ for the univariate Gaussian:
•Find the maximum likelihood estimator of β for the degenerate gamma distribution:
•Hint: consider the log of the likelihood fns in both cases
![Page 4: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/4.jpg)
Solutions•PDF for one data point:
•Joint likelihood of N data points:
![Page 5: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/5.jpg)
Solutions•Log-likelihood:
![Page 6: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/6.jpg)
Solutions•Log-likelihood:
•Differentiate w.r.t. μ:
![Page 7: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/7.jpg)
Solutions•Log-likelihood:
•Differentiate w.r.t. μ:
![Page 8: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/8.jpg)
Solutions•Log-likelihood:
•Differentiate w.r.t. μ:
![Page 9: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/9.jpg)
Solutions•Log-likelihood:
•Differentiate w.r.t. μ:
![Page 10: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/10.jpg)
Solutions•What about for the gamma PDF?
![Page 11: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/11.jpg)
Putting the parts together
[X,Y]
com
ple
te
train
ing
data
![Page 12: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/12.jpg)
Putting the parts together
Assumed distributionfamily (hyp. space)w/ parameters Θ
Parameters for class a:
Specific PDFfor class a
![Page 13: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/13.jpg)
Putting the parts together
![Page 14: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/14.jpg)
Putting the parts together
![Page 15: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/15.jpg)
Gaussian Distributions
![Page 16: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/16.jpg)
5 minutes of math...•Recall your friend the Gaussian PDF:
•I asserted that the d-dimensional form is:
•Let’s look at the parts...
![Page 17: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/17.jpg)
5 minutes of math...
![Page 18: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/18.jpg)
5 minutes of math...•Ok, but what do the parts mean?
•Mean vector, : mean of data along each dimension
![Page 19: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/19.jpg)
5 minutes of math...•Covariance matrix
•Like variance, but describes spread of data
![Page 20: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/20.jpg)
5 minutes of math...•Note: covariances on the diagonal of are same as standard variances on that dimension of data
•But what about skewed data?
![Page 21: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/21.jpg)
5 minutes of math...•Off-diagonal covariances ( ) describe the pairwise variance
•How much xi changes as x
j changes (on
avg)
![Page 22: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/22.jpg)
5 minutes of math...•Calculating from data:
•In practice: you want to measure the covariance between every pair of random variables (dimensions):
•Or, in linear algebra:
![Page 23: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/23.jpg)
Bayesian Wrap-Up(probably)
![Page 24: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/24.jpg)
5 minutes of math...•Marginal probabilities
•If you have a joint PDF:
•... and want to know about the probability of just one RV (regardless of what happens to the others)
•Marginal PDF of or :
![Page 25: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/25.jpg)
5 minutes of math...•Conditional probabilities
•Suppose you have a joint PDF, f(H,W)
•Now you get to see one of the values, e.g., H=“183cm”
•What’s your probability estimate of W, given this new knowledge?
![Page 26: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/26.jpg)
5 minutes of math...•Conditional probabilities
•Suppose you have a joint PDF, f(H,W)
•Now you get to see one of the values, e.g., H=“183cm”
•What’s your probability estimate of A, given this new knowledge?
![Page 27: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/27.jpg)
5 minutes of math...•From cond prob. rule, it’s 2 steps to Bayes’ rule:
•(Often helps algebraically to think of “given that” operator, “|”, as a division operation)
![Page 28: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/28.jpg)
Everything’s random...•Basic Bayesian viewpoint:
•Treat (almost) everything as a random variable
•Data/independent var: X vector
•Class/dependent var: Y
•Parameters: Θ
•E.g., mean, variance, correlations, multinomial params, etc.
•Use Bayes’ Rule to assess probabilities of classes
•Allows us to say: “It is is very unlikely that the mean height is 2 light years”
![Page 29: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/29.jpg)
Uncertainty over params•Maximum likelihood treats parameters as
(unknown) constants
•Job is just to pick the constants so as to maximize data likelihood
•Fullblown Bayesian modeling treats params as random variables
•PDF over parameter variables tells us how certain/uncertain we are about the location of that parameter
•Also allows us to express prior beliefs (probabilities) about params
![Page 30: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/30.jpg)
Example: Coin flipping•Have a “weighted” coin -- want to figure out
θ=Pr[heads]
•Maximum likelihood:
•Flip coin a bunch of times, measure #heads; #tails
•Use estimator to return a single value for θ
•Bayesian (MAP):
•Start w/ distribution over what θmight be
•Flip coin a bunch of times, measure #heads; #tails
•Update distribution, but never reduce to a single number
![Page 31: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/31.jpg)
Example: Coin flipping
?
??
??
?
?
0 flips total
![Page 32: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/32.jpg)
Example: Coin flipping
1 flip total
![Page 33: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/33.jpg)
Example: Coin flipping
5 flips total
![Page 34: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/34.jpg)
Example: Coin flipping
10 flips total
![Page 35: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/35.jpg)
Example: Coin flipping
20 flips total
![Page 36: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/36.jpg)
Example: Coin flipping
50 flips total
![Page 37: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/37.jpg)
Example: Coin flipping
100 flips total
![Page 38: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/38.jpg)
How does it work?•Think of parameters as just another kind of random variable
•Now your data distribution is
•This is the generative distribution
•A.k.a. observation distribution, sensor model, etc.
•What we want is some model of parameter as a function of the data
•Get there with Bayes’ rule:
![Page 39: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/39.jpg)
What does that mean?•Let’s look at the parts:
•Generative distribution
•Describes how data is generated by the underlying process
•Usually easy to write down (well, easier than the other parts, anyway)
•Same old PDF/PMF we’ve been working with
•Can be used to “generate” new samples of data that “look like” your training data
![Page 40: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/40.jpg)
What does that mean?•The parameter prior or a priori distribution:
•Allows you to say “this value of is more likely than that one is...”
•Allows you to express beliefs/assumptions/ preferences about the parameters of the system
•Also takes over when the data is sparse (small N)
•In the limit of large data, prior should “wash out”, letting the data dominate the estimate of the parameter
•Can let be “uniform” (a.k.a., “uninformative”) to minimize its impact
![Page 41: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/41.jpg)
What does that mean?•The data prior:
•Expresses the probability of seeing data set X independent of any particular model
•Huh?
![Page 42: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/42.jpg)
What does that mean?•The data prior:
•Expresses the probability of seeing data set X independent of any particular model
•Can get it from the joint data/parameter model:
•In practice, often don’t need it explicitly (why?)
![Page 43: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/43.jpg)
What does that mean?•Finally, the posterior (or a posteriori)
distribution:
•Lit., “from what comes after” or “after the fact” (Latin)
•Essentially, “What we believe about the parameter after we look at the data”
•As compared to the “prior” or “a priori” (lit., “from what is before” or “before the fact”) parameter distribution,
![Page 44: IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649ec05503460f94bcc2e0/html5/thumbnails/44.jpg)
Exercise•Suppose you want to estimate the average air speed of an unladen (African) swallow
•Let’s say that airspeeds of individual swallows, x, are Gaussianly distributed with mean and variance 1:
•Let’s say, also, that we think the mean is “around” 50 kph, but we’re not sure exactly what it is. But our uncertainty (variance) is 10 kph.
•Derive the posterior estimate of the mean airspeed.