introduction to machine learning lecture 2mohri/mlu/mlu_lecture_2.pdf · introduction to machine...

19
Introduction to Machine Learning Lecture 2 Mehryar Mohri Courant Institute and Google Research [email protected]

Upload: ngodan

Post on 16-May-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

Introduction to Machine LearningLecture 2

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

Basic Probability Notions

Page 3: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Probabilistic Model

Sample space: , set of all outcomes or elementary events possible in a trial, e.g., casting a die or tossing a coin.

Event: subset of sample space. The set of all events must be closed under complementation and countable union and intersection.

Probability distribution: mapping from the set of all events to such that , and for all mutually exclusive events,

3

!

A ! !

Pr

[0, 1] Pr[!] = 1

Pr[A1 ! . . . ! An] =n!

i=1

Pr[Ai].

Page 4: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Random Variables

Definition: a random variable is a function such that for any interval , the subset of the sample space is an event. Such a function is said to be measurable.Example: the sum of the values obtained when casting a die.Probability mass function of random variable : function Joint probability mass function of and :

4

A: X(A)∈I

X: Ω→RI

Xf: x →f(x)=Pr[X =x].

X Y

f: (x, y) → f(x, y)=Pr[X =x∧Y =y].

Page 5: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Conditional Probability andIndependence

Conditional probability of event given :

when Independence: two events and are independent when

Equivalently, when

5

A B

Pr[A | B] =Pr[A ! B]

Pr[B],

Pr[B] != 0.

A B

Pr[A ! B] = Pr[A] Pr[B].

Pr[A | B] = Pr[A], Pr[B] != 0.

Page 6: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Some Probability Formulae

Sum rule:

Union bound:

Bayes formula:

6

Pr[A ! B] = Pr[A] + Pr[B] " Pr[A # B].

Pr[n!

i=1

Ai] !n"

i=1

Pr[Ai].

Pr[X | Y ] =Pr[Y | X] Pr[X]

Pr[Y ](Pr[Y ] != 0).

Page 7: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Some Probability Formulae

Chain rule:

Theorem of total probability: assume that

then for any event ,

7

Pr[!n

i=1Xi] = Pr[X1] Pr[X2 | X1] Pr[X3 | X1 ! X2]

. . .Pr[Xn |!n!1

i=1Xi].

! = A1 ! A2 ! . . . ! An, with Ai " Aj = # for i $= j;

B

Pr[B] =n!

i=1

Pr[B | Ai] Pr[Ai].

Page 8: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Expectation

Definition: the expectation (or mean) of a random variable is

Properties:

• linearity,

• if and are independent,

8

X

E[X] =!

x

xPr[X = x].

E[aX + bY ] = aE[X] + bE[Y ].

X Y

E[XY ] = E[X]E[Y ].

Page 9: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Expectation

Theorem (Markov’s inequality): let be a non-negative random variable with , then for all ,

Proof:

9

X

t > 0

Pr[X ! tE[X]] "1

t.

E[X] < !

Pr[X ≥ t E[X ]] =

x≥tE[X]

Pr[X = x]

x≥t E[X]

Pr[X = x]x

t E[X ]

x

Pr[X = x]x

t E[X ]

= E

X

t E[X ]

=

1t.

Page 10: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Variance

Definition: the variance of a random variable is

is called the standard deviation of the random variable .

Properties:

• if and are independent,

10

X

Var[X] = !2

X = E[(X ! E[X])2].

!X

X

Var[aX] = a2Var[X].

X Y

Var[X + Y ] = Var[X] + Var[Y ].

Page 11: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Variance

Theorem (Chebyshev’s inequality): let be a random variable with , then for all

Proof: Observe that

The result follows Markov’s inequality.

11

X

Var[X] < !

t > 0,

Pr[|X ! E[X]| " t!X ] #1

t2.

Pr[|X ! E[X]| " t!X ] = Pr[(X ! E[X])2 " t2!

2

X ].

Page 12: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Application

Experiment: roll a pair of fair dice n times. Can we give a good estimate of the sum of the values after n rolls?

Mean: 7n, variance: 35/6 n; thus by Chebyshev’s inequality, the final sum will lie between

in at least 99% of all experiments. The odds are better than 99 to 1 that the sum be roughly between 6.976M and 7.024M after 1M rolls.

7n ! 10

!

35

6n and 7n + 10

!

35

6n

12

Page 13: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Weak Law of Large Numbers

Theorem: let be a sequence of independent random variables with the same mean and variance and let , then for any ,

Proof: Since the variables are independent,

Thus, by Chebyshev’s inequality,

13

(Xn)n!N

µ

limn!"

Pr[|Xn ! µ| " !] = 0.

Pr[|Xn ! µ| " !] #"2

n!2.

Var[Xn] =n

i=1

VarXi

n

=

nσ2

n2=

σ2

n.

σ2<∞ Xn = 1n

ni=1 Xi

>0

Page 14: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Theorem: Let be independent random variables . Then for , the following inequalities hold for :

• Proof: The proof is based on Chernoff’s bounding technique: for any random variable and , apply Markov’s inequality and select to minimize

Hoeffding’s Theorem

14

X1, . . . , Xm

Xi∈ [ai, bi] >0Sm =

mi=1 Xi

Pr[Sm − E[Sm] ≥ ] ≤ e−22/Pm

i=1(bi−ai)2

Pr[Sm − E[Sm] ≤ −] ≤ e−22/Pm

i=1(bi−ai)2.

X t>0

Pr[X ≥ ] = Pr[etX ≥ et] ≤ E[etX ]et

.

t

Page 15: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

• Using this scheme and the independence of the random variables gives

• The second inequality is proved in a similar way.

15

choosing t = 4/m

i=1(bi − ai)2.

Pr[Sm − E[Sm] ≥ ]

≤ e−t E[et(Sm−E[Sm])]

= e−tΠmi=1 E[et(Xi−E[Xi])]

(lemma applied toXi−E[Xi]) ≤ e−tΠmi=1e

t2(bi−ai)2/8

= e−tet2Pm

i=1(bi−ai)2/8

≤ e−22/Pm

i=1(bi−ai)2,

Page 16: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Hoeffding’s Lemma

Lemma: Let be a random variable with and with . Then for ,

Proof: by convexity of , for all ,

16

E[etX ] ≤ et2(b−a)2

8 .

E[X ]=0Xa≤X≤b b =a t>0

a≤x≤b

etx ≤ b − x

b− aeta +

x− a

b− aetb.

Thus,

with,φ(t) = log( b

b−aeta + −ab−aetb) = ta + log( b

b−a + −ab−aet(b−a)).

x →etx

E[etX ] ≤ E[ b−Xb−a eta + X−a

b−a etb] = bb−aeta + −a

b−aetb = eφ(t),

Page 17: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

• Taking the derivative gives:

• Note that: Furthermore,

with There exists such that:

17

φ(t) = a− aet(b−a)

bb−a−

ab−a et(b−a) = a− a

bb−a e−t(b−a)− a

b−a.

φ(0) = 0 and φ(0) = 0.

Φ(t) =−abe−t(b−a)

[ bb−ae−t(b−a) − a

b−a ]2

=α(1− α)e−t(b−a)(b− a)2

[(1− α)e−t(b−a) + α]2

[(1− α)e−t(b−a) + α](1− α)e−t(b−a)

[(1− α)e−t(b−a) + α](b− a)2

= u(1− u)(b − a)2 ≤ (b− a)2

4,

α =−a

b− a. 0≤θ≤ t

φ(t) = φ(0) + tφ(0) +t2

2φ(θ) ≤ t2

(b− a)2

8.

Page 18: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

Example: Tossing a Coin

Problem: estimate bias of a coin.

Let . Then and is the percentage of tails in the sample. Thus, with probability at least ,

18

H, T, T, H, T, H, H, T, H, H, H, T, T, . . . ,H.

p

h=1H p=R(h) p= R(h)

|p− p| ≤

log 2

δ

2m.

1−δ

Thus, choosing and implies that with probability at least 98%,

δ= .02

|p− p|≤

log(10)/1000≈ .048.

m=1000

Page 19: Introduction to Machine Learning Lecture 2mohri/mlu/mlu_lecture_2.pdf · Introduction to Machine Learning Lecture 2 Mehryar Mohri ... Pr[A 1 ∪...∪A n]=!n i=1 Pr[A i]. Mehryar

pageMehryar Mohri - Introduction to Machine Learning

McDiarmid’s Inequality

Theorem: let be independent random variables taking values in and a function verifying for all ,

19

(McDiarmid, 1989)X1, . . . , Xm

U f : Um→R

supx1,...,xm,x

i

|f(x1, . . . , xi, . . . , xm)− f(x1, . . . , xi, . . . , xm)| ≤ c.

i∈ [1, m]

Then, for all ,

Prf(X1, . . . , Xm)− E[f(X1, . . . , Xm)]

>≤2 exp

− 22

mc2

.

>0