introduction to machine learning lecture 2mohri/mlu/mlu_lecture_2.pdf · introduction to machine...
TRANSCRIPT
Introduction to Machine LearningLecture 2
Mehryar MohriCourant Institute and Google Research
Basic Probability Notions
pageMehryar Mohri - Introduction to Machine Learning
Probabilistic Model
Sample space: , set of all outcomes or elementary events possible in a trial, e.g., casting a die or tossing a coin.
Event: subset of sample space. The set of all events must be closed under complementation and countable union and intersection.
Probability distribution: mapping from the set of all events to such that , and for all mutually exclusive events,
3
!
A ! !
Pr
[0, 1] Pr[!] = 1
Pr[A1 ! . . . ! An] =n!
i=1
Pr[Ai].
pageMehryar Mohri - Introduction to Machine Learning
Random Variables
Definition: a random variable is a function such that for any interval , the subset of the sample space is an event. Such a function is said to be measurable.Example: the sum of the values obtained when casting a die.Probability mass function of random variable : function Joint probability mass function of and :
4
A: X(A)∈I
X: Ω→RI
Xf: x →f(x)=Pr[X =x].
X Y
f: (x, y) → f(x, y)=Pr[X =x∧Y =y].
pageMehryar Mohri - Introduction to Machine Learning
Conditional Probability andIndependence
Conditional probability of event given :
when Independence: two events and are independent when
Equivalently, when
5
A B
Pr[A | B] =Pr[A ! B]
Pr[B],
Pr[B] != 0.
A B
Pr[A ! B] = Pr[A] Pr[B].
Pr[A | B] = Pr[A], Pr[B] != 0.
pageMehryar Mohri - Introduction to Machine Learning
Some Probability Formulae
Sum rule:
Union bound:
Bayes formula:
6
Pr[A ! B] = Pr[A] + Pr[B] " Pr[A # B].
Pr[n!
i=1
Ai] !n"
i=1
Pr[Ai].
Pr[X | Y ] =Pr[Y | X] Pr[X]
Pr[Y ](Pr[Y ] != 0).
pageMehryar Mohri - Introduction to Machine Learning
Some Probability Formulae
Chain rule:
Theorem of total probability: assume that
then for any event ,
7
Pr[!n
i=1Xi] = Pr[X1] Pr[X2 | X1] Pr[X3 | X1 ! X2]
. . .Pr[Xn |!n!1
i=1Xi].
! = A1 ! A2 ! . . . ! An, with Ai " Aj = # for i $= j;
B
Pr[B] =n!
i=1
Pr[B | Ai] Pr[Ai].
pageMehryar Mohri - Introduction to Machine Learning
Expectation
Definition: the expectation (or mean) of a random variable is
Properties:
• linearity,
• if and are independent,
8
X
E[X] =!
x
xPr[X = x].
E[aX + bY ] = aE[X] + bE[Y ].
X Y
E[XY ] = E[X]E[Y ].
pageMehryar Mohri - Introduction to Machine Learning
Expectation
Theorem (Markov’s inequality): let be a non-negative random variable with , then for all ,
Proof:
9
X
t > 0
Pr[X ! tE[X]] "1
t.
E[X] < !
Pr[X ≥ t E[X ]] =
x≥tE[X]
Pr[X = x]
≤
x≥t E[X]
Pr[X = x]x
t E[X ]
≤
x
Pr[X = x]x
t E[X ]
= E
X
t E[X ]
=
1t.
pageMehryar Mohri - Introduction to Machine Learning
Variance
Definition: the variance of a random variable is
is called the standard deviation of the random variable .
Properties:
•
• if and are independent,
10
X
Var[X] = !2
X = E[(X ! E[X])2].
!X
X
Var[aX] = a2Var[X].
X Y
Var[X + Y ] = Var[X] + Var[Y ].
pageMehryar Mohri - Introduction to Machine Learning
Variance
Theorem (Chebyshev’s inequality): let be a random variable with , then for all
Proof: Observe that
The result follows Markov’s inequality.
11
X
Var[X] < !
t > 0,
Pr[|X ! E[X]| " t!X ] #1
t2.
Pr[|X ! E[X]| " t!X ] = Pr[(X ! E[X])2 " t2!
2
X ].
pageMehryar Mohri - Introduction to Machine Learning
Application
Experiment: roll a pair of fair dice n times. Can we give a good estimate of the sum of the values after n rolls?
Mean: 7n, variance: 35/6 n; thus by Chebyshev’s inequality, the final sum will lie between
in at least 99% of all experiments. The odds are better than 99 to 1 that the sum be roughly between 6.976M and 7.024M after 1M rolls.
7n ! 10
!
35
6n and 7n + 10
!
35
6n
12
pageMehryar Mohri - Introduction to Machine Learning
Weak Law of Large Numbers
Theorem: let be a sequence of independent random variables with the same mean and variance and let , then for any ,
Proof: Since the variables are independent,
Thus, by Chebyshev’s inequality,
13
(Xn)n!N
µ
limn!"
Pr[|Xn ! µ| " !] = 0.
Pr[|Xn ! µ| " !] #"2
n!2.
Var[Xn] =n
i=1
VarXi
n
=
nσ2
n2=
σ2
n.
σ2<∞ Xn = 1n
ni=1 Xi
>0
pageMehryar Mohri - Introduction to Machine Learning
Theorem: Let be independent random variables . Then for , the following inequalities hold for :
• Proof: The proof is based on Chernoff’s bounding technique: for any random variable and , apply Markov’s inequality and select to minimize
Hoeffding’s Theorem
14
X1, . . . , Xm
Xi∈ [ai, bi] >0Sm =
mi=1 Xi
Pr[Sm − E[Sm] ≥ ] ≤ e−22/Pm
i=1(bi−ai)2
Pr[Sm − E[Sm] ≤ −] ≤ e−22/Pm
i=1(bi−ai)2.
X t>0
Pr[X ≥ ] = Pr[etX ≥ et] ≤ E[etX ]et
.
t
pageMehryar Mohri - Introduction to Machine Learning
• Using this scheme and the independence of the random variables gives
• The second inequality is proved in a similar way.
15
choosing t = 4/m
i=1(bi − ai)2.
Pr[Sm − E[Sm] ≥ ]
≤ e−t E[et(Sm−E[Sm])]
= e−tΠmi=1 E[et(Xi−E[Xi])]
(lemma applied toXi−E[Xi]) ≤ e−tΠmi=1e
t2(bi−ai)2/8
= e−tet2Pm
i=1(bi−ai)2/8
≤ e−22/Pm
i=1(bi−ai)2,
pageMehryar Mohri - Introduction to Machine Learning
Hoeffding’s Lemma
Lemma: Let be a random variable with and with . Then for ,
Proof: by convexity of , for all ,
16
E[etX ] ≤ et2(b−a)2
8 .
E[X ]=0Xa≤X≤b b =a t>0
a≤x≤b
etx ≤ b − x
b− aeta +
x− a
b− aetb.
Thus,
with,φ(t) = log( b
b−aeta + −ab−aetb) = ta + log( b
b−a + −ab−aet(b−a)).
x →etx
E[etX ] ≤ E[ b−Xb−a eta + X−a
b−a etb] = bb−aeta + −a
b−aetb = eφ(t),
pageMehryar Mohri - Introduction to Machine Learning
• Taking the derivative gives:
• Note that: Furthermore,
with There exists such that:
17
φ(t) = a− aet(b−a)
bb−a−
ab−a et(b−a) = a− a
bb−a e−t(b−a)− a
b−a.
φ(0) = 0 and φ(0) = 0.
Φ(t) =−abe−t(b−a)
[ bb−ae−t(b−a) − a
b−a ]2
=α(1− α)e−t(b−a)(b− a)2
[(1− α)e−t(b−a) + α]2
=α
[(1− α)e−t(b−a) + α](1− α)e−t(b−a)
[(1− α)e−t(b−a) + α](b− a)2
= u(1− u)(b − a)2 ≤ (b− a)2
4,
α =−a
b− a. 0≤θ≤ t
φ(t) = φ(0) + tφ(0) +t2
2φ(θ) ≤ t2
(b− a)2
8.
pageMehryar Mohri - Introduction to Machine Learning
Example: Tossing a Coin
Problem: estimate bias of a coin.
Let . Then and is the percentage of tails in the sample. Thus, with probability at least ,
18
H, T, T, H, T, H, H, T, H, H, H, T, T, . . . ,H.
p
h=1H p=R(h) p= R(h)
|p− p| ≤
log 2
δ
2m.
1−δ
Thus, choosing and implies that with probability at least 98%,
δ= .02
|p− p|≤
log(10)/1000≈ .048.
m=1000
pageMehryar Mohri - Introduction to Machine Learning
McDiarmid’s Inequality
Theorem: let be independent random variables taking values in and a function verifying for all ,
19
(McDiarmid, 1989)X1, . . . , Xm
U f : Um→R
supx1,...,xm,x
i
|f(x1, . . . , xi, . . . , xm)− f(x1, . . . , xi, . . . , xm)| ≤ c.
i∈ [1, m]
Then, for all ,
Prf(X1, . . . , Xm)− E[f(X1, . . . , Xm)]
>≤2 exp
− 22
mc2
.
>0