the beta distribution
TRANSCRIPT
![Page 1: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/1.jpg)
![Page 2: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/2.jpg)
the Beta distribution
Beta distribution
- x 2 [0, 1], parameters: ⇥ = (↵,�) 2 R+ (‘# ones’+1,‘# zeros’+1)
- pdf: p(x) / x↵�1(1� x)��1
- normalizing constant: 1
B(↵,�) =�(↵+�)�(↵)�(�)
![Page 3: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/3.jpg)
Beta-Bernoulli prior and updates
• data D = {X1,X2, . . . ,Xn} 2 {0, 1}n, contains N1 ones and N0 zeros
• model M: Xi are generated i.i.d. from a Ber(✓) distribution
Beta-Bernoulli model
- prior parameters: ⇥0 = (↵,�) 2 R+ (hyperparameters)
- Beta-Bernoulli prior: Beta(↵,�) ⇠ p(✓) / ✓↵�1(1� ✓)��1
- likelihood: p(D|✓) = ✓N1(1� ✓)N0
then via Bayesian update we get
- posterior:
p(✓|D) / ✓↵�1(1� ✓)��1✓N1(1� ✓)N0 ⇠ Beta(↵+ N1,� + N0)
![Page 4: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/4.jpg)
the Beta distribution: getting familiar
Beta(↵,�) distribution
p(x) =�(↵+ �)
�(↵)�(�)x↵�1(1� x)��1
properties of �(↵)
![Page 5: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/5.jpg)
the Beta distribution: mean and mode
Beta(↵,�) distribution
p(x) =�(↵+ �)
�(↵)�(�)x↵�1(1� x)��1
![Page 6: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/6.jpg)
![Page 7: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/7.jpg)
Beta-Bernoulli model: what should we report?
• data D = {X1,X2, . . . ,Xn} 2 {0, 1}n, contains N1 ones and N0 zeros
• model M: Xi are generated i.i.d. from a Ber(✓) distribution
• prior: p(✓) ⇠ Beta(↵,�) posterior: p(✓|D) ⇠ Beta(↵+ N1,� + N0)
![Page 8: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/8.jpg)
decision theory
![Page 9: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/9.jpg)
decision theory in a nutshell
Bayesian decision theory in learning
given prior F on ✓, choose ‘action’ ✓̂ to minimize loss function EF [L(✓, ✓̂)]
examples
- L0 loss: L(✓, ✓̂) = 1{✓ 6=✓̂ ) ✓̂L0 = mode of F
- L1 loss: L(✓, ✓̂) = ||✓ � ✓̂||1 ) ✓̂L1 = median of ✓ under F
- L2 loss: L(✓, ✓̂) = ||✓ � ✓̂||2 ) ✓̂L2 = EF [✓]
in general ‘decision-making’
given prior F on X , choose ‘action’ a 2 A to minimize loss, i.e.
a⇤ = argmin
a2AEX⇠F [L(a,X )]
![Page 10: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/10.jpg)
Beta-Bernoulli model: posterior mean
• data D = {X1,X2, . . . ,Xn} 2 {0, 1}n, contains N1 ones and N0 zeros
• model M: Xi are generated i.i.d. from a Ber(✓) distribution
• prior: p(✓) ⇠ Beta(↵,�) posterior: p(✓|D) ⇠ Beta(↵+ N1,� + N0)
posterior mean: E[✓|↵,�,N0,N1] =
![Page 11: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/11.jpg)
Beta-Bernoulli model: posterior mode (MAP estimation)
• data D = {X1,X2, . . . ,Xn} 2 {0, 1}n, contains N1 ones and N0 zeros
• model M: Xi are generated i.i.d. from a Ber(✓) distribution
• prior: p(✓) ⇠ Beta(↵,�) posterior: p(✓|D) ⇠ Beta(↵+ N1,� + N0)
posterior mode: max✓2[0,1] p(✓|↵,�,N0,N1) =
![Page 12: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/12.jpg)
Beta-Bernoulli model: posterior prediction (marginalization)
• data D = {X1,X2, . . . ,Xn} 2 {0, 1}n, contains N1 ones and N0 zeros
• model M: Xi are generated i.i.d. from a Ber(✓) distribution
• prior: p(✓) ⇠ Beta(↵,�) posterior: p(✓|D) ⇠ Beta(↵+ N1,� + N0)
posterior prediction: P[X = 1|D] =
![Page 13: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/13.jpg)
the black swan
![Page 14: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/14.jpg)
summarizing the posterior
model M+ prior p(⇥)+ data D ) posterior p(⇥|D)
summarizing p(⇥|D)
• posterior mean b✓mean = E[⇥|D]
• posterior mode (or MAP estimate) b✓MAP = argmax⇥ p(⇥|D)
• posterior median b✓median = min{⇥ : p(⇥|D) � 0.5}• Bayesian credible intervals: given � > 0, want (`⇥, u⇥) s.t.
P[`⇥ ⇥ u⇥|D] > 1� �
![Page 15: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/15.jpg)
marginal likelihood (evidence)
• data D = {X1,X2, . . . ,Xn} 2 {0, 1}n, contains N1 ones and N0 zeros
• model M: Xi are generated i.i.d. from a Ber(✓) distribution
marginal likelihood
p(D) =p(✓)p(D|✓)p(✓|D)
=prior⇥ likelihood
posterior
![Page 16: the Beta distribution](https://reader036.vdocuments.us/reader036/viewer/2022062301/62a50712f196be0cc6527ef5/html5/thumbnails/16.jpg)
multiclass data
• data D = {X1,X2, . . . ,Xn} 2 {1, 2, . . . ,K}n
• for i 2 [K ], data D contains Ni copies of type i
• model M: Xi generated i.i.d. from Multinomial(✓1, ✓2, . . . , ✓K ) distn