ece595 / stat598: machine learning i lecture 23 ... · theorem (slln) let x 1;:::;x n be a sequence...

22
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 23 Probability Inequality Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 22

Upload: others

Post on 22-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 23 Probability Inequality

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 22

Page 2: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Outline

Lecture 22 Is Learning Feasible?

Lecture 23 Probability Inequality

Lecture 24 Probably Approximate Correct

Today’s Lecture:

Basic Inequalities

Markov and ChebyshevInterpreting the results

Advance Inequalities

Chernoff inequalityHoeffding inequality

2 / 22

Page 3: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Empirical Average

We want to take a detour to talk about probability inequalities

These inequalities will become useful when studying learning theory

Let us look at 1D case.

You have random variables X1,X2, . . . ,XN .

Assume independently identically distributed i.i.d.

This impliesE[X1] = E[X2] = . . . = E[XN ] = µ

You compute the empirical average

ν =1

N

N∑n=1

Xn

How close is ν to µ?

3 / 22

Page 4: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

As N grows ...

4 / 22

Page 5: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

As N grows ...

102 103 104 105

number of trials

0.45

0.5

0.55

pro

ba

bili

ty

5 / 22

Page 6: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Interpreting the Empirical Average

ν =1

N

N∑n=1

Xn

ν is a random variable

ν has CDF and PDF

ν has mean

E[ν] = E

[1

N

N∑n=1

Xn

]=

1

N

N∑n=1

E[Xn]

=1

NNµ = µ.

Note that “E[ν] = µ” is not the same as “ν = µ”.

What is the probability ν deviates from µ?6 / 22

Page 7: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Probability of Bad Event

P [|ν − µ| > ε] = ?

B = {|ν − µ| > ε}: The Bad event: ν deviates from µ by at least ε

P[B] = probability that this bad event happens.

Want P[B] small. So upper bound it by δ.

P [|ν − µ| > ε] ≤ δ.

With probability no greater than δ, Bad event happens.

Rearrange the equation:

P [|ν − µ|≤ε] > 1−δ.

With probability at least 1− δ, the Bad event will not happen.

7 / 22

Page 8: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Markov Inequality

Theorem (Markov Inequality)

For any X > 0 and ε > 0,

P[X ≥ ε] ≤ E[X ]

ε.

εP[X ≥ ε] = ε

∫ ∞ε

p(x)dx

=

∫ ∞ε

ε p(x)dx

≤∫ ∞ε

xp(x)dx

≤∫ ∞0

xp(x)dx = E[X ].

8 / 22

Page 9: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Chebyshev Inequality

Theorem (Chebyshev Inequality)

Let X1, . . . ,XN be i.i.d. with E[Xn] = µ and Var[Xn] = σ2. Define

ν =1

N

N∑n=1

Xn.

Then,

P [|ν − µ| > ε] ≤ σ2

Nε2

P[|ν − µ|2 > ε2

]≤ E[|ν − µ|2]

ε2︸ ︷︷ ︸Markov

=Var[ν]

ε2︸ ︷︷ ︸E[(ν−µ)2]=var[ν]

=σ2

Nε2.︸ ︷︷ ︸

var[ν]=σ2

N

9 / 22

Page 10: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

How Good is Chebyshev Inequality?

10 / 22

Page 11: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Weak Law of Large Number

Theorem (WLLN)

Let X1, . . . ,XN be a sequence of i.i.d. random variables with commonmean µ. Let MN = 1

N

∑Nn=1 Xn. Then, for any ε > 0,

limN→∞

P [|MN − µ| > ε] = 0. (1)

Remark:

The limit is outside the probability.This means that the probability of the event |MN − µ| > ε isdiminishing as N →.But diminishing probability can still have occasions where|MN − µ| > ε.It just means that these occasions do not happen often.

11 / 22

Page 12: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Strong Law of Large Number

Theorem (SLLN)

Let X1, . . . ,XN be a sequence of i.i.d. random variables with commonmean µ. Let MN = 1

N

∑Nn=1 Xn. Then, for any ε > 0,

P[

limN→∞

|MN − µ| > ε

]= 0. (2)

Remark:

The limit is inside the probability.We need to analyze the limiting object limN→∞ |MN − µ|This object may or may not exist. This object is another randomvariable.The probability is measuring the event that this limiting object willdeviate significantly from εThere is no “occasional” outliers.

12 / 22

Page 13: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Outline

Lecture 22 Is Learning Feasible?

Lecture 23 Probability Inequality

Lecture 24 Probably Approximate Correct

Today’s Lecture:

Basic Inequalities

Markov and ChebyshevInterpreting the results

Advance Inequalities

Chernoff inequalityHoeffding inequality

13 / 22

Page 14: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Hoeffding Inequality

Let us revisit the Bad event:

P[|ν − µ| ≥ ε] = P[ν − µ ≥ ε or ν − µ ≤ −ε]≤ P[ν − µ ≥ ε]︸ ︷︷ ︸

≤A

+ P[ν − µ ≤ −ε]︸ ︷︷ ︸≤A

, Union bound

≤ 2A, (What is A? To be discussed.)

Theorem (Hoeffding Inequality)

Let X1, . . . ,XN be random variables with 0 ≤ Xn ≤ 1, then

P [|ν − µ| > ε] ≤ 2e−2ε2N︸ ︷︷ ︸

=A

14 / 22

Page 15: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

The e-trick + Markov Inequality

Let us check one side:

P[ν − µ ≥ ε] = P

[1

N

N∑n=1

Xn − µ ≥ ε

]= P

[N∑

n=1

(Xn − µ) ≥ εN

]= P

[es

∑Nn=1(Xn−µ) ≥ esεN

], ∀s > 0

≤E[es

∑Nn=1(Xn−µ)

]esεN

, Markov Inequality

=

(E[es(Xn−µ)

]esε

)N

, Independence

If we let Zn = Xn − µ, then

E[es(Xn−µ)] = MZn(s) = MGF of Zn.

15 / 22

Page 16: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Hoeffding Lemma

So now we have

P[ν − µ ≥ ε] ≤

(E[es(Xn−µ)

]esε

)N

Lemma (Hoeffding Lemma)

If a ≤ Xn ≤ b, then

E[es(Xn−µ)

]≤ e

s2(b−a)2

8

This leads to

P[ν − µ ≥ ε] =

(E[es(Xn−µ)

]esε

)N

es2

8

esε

N

= es2N8−sεN , ∀s > 0.

16 / 22

Page 17: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Minimization

Finally, we arrive at:

P[ν − µ ≥ ε] ≤ es2N8−sεN .

Since holds for all s > 0, in particular it holds for the minimizer:

P[ν − µ ≥ ε] ≤ esmin

2N

8−sminεN = min

s>0

{e

s2N8−sεN

}Minimizing the exponent gives: d

ds

{s2N8 − sεN

}= sN

4 − εN = 0. So

s = 4ε.

P[ν − µ ≥ ε] ≤ e(4ε)2N

8−(4ε)εN = e−2ε

2N .

17 / 22

Page 18: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Hoeffding Inequality

Theorem (Hoeffding Inequality)

Let X1, . . . ,XN be random variables with 0 ≤ Xn ≤ 1, then

P [|ν − µ| > ε] ≤ 2e−2ε2N

18 / 22

Page 19: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Compare Hoeffding and Chebyshev

Chebyshev:

P [|ν − µ| ≥ ε] ≤ σ2

Nε2.

Hoeffding:

P [|ν − µ| ≥ ε] ≤ 2e−2ε2N .

Both are in the form of

P [|ν − µ| ≥ ε] ≤ δ.

Equivalent to: For probability at least 1− δ, we have

µ− ε ≤ ν ≤ µ+ ε.

Error bar / Confidence interval of ν.

δ =σ2

Nε2⇒ ε =

σ√δN

δ = 2e−2ε2N ⇒ ε =

√1

2Nlog

2

δ

19 / 22

Page 20: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Example

Chebyshev: For probability at least 1− δ, we have

µ− σ√δN≤ ν ≤ µ+

σ√δN

.

Hoeffding: For probability at least 1− δ, we have

µ−√

1

2Nlog

2

δ≤ ν ≤ µ+

√1

2Nlog

2

δ.

Example:

Alex: I have data X1, . . . ,XN . I want to estimate µ. How many datapoints N do I need?

Bob: How much δ can you tolerate?

Alex: Alright. I only have limited number of data points. How goodmy estimate is? (ε)

Bob: How many data points N do you have?20 / 22

Page 21: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Example

Chebyshev: For probability at least 1− δ, we have

µ− σ√δN≤ ν ≤ µ+

σ√δN

.

Hoeffding: For probability at least 1− δ, we have

µ−√

1

2Nlog

2

δ≤ ν ≤ µ+

√1

2Nlog

2

δ.

Let δ = 0.01, N = 10000, σ = 1.

ε =σ√δN

= 0.1 ε =

√1

2Nlog

2

δ= 0.016

Let δ = 0.01, ε = 0.01, σ = 1.

N ≥ σ2

ε2δ= 1, 000, 000. N ≥

log 2δ

2ε2≈ 26, 500.

21 / 22

Page 22: ECE595 / STAT598: Machine Learning I Lecture 23 ... · Theorem (SLLN) Let X 1;:::;X N be a sequence of i.i.d. random variables with common mean . Let M N = 1 N P N n=1 X n. Then,

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Abu-Mustafa, Learning from Data, Chapter 2.

Martin Wainwright, High Dimensional Statistics, CambridgeUniversity Press 2019. (Chapter 2)

Cornell Note,https://www.cs.cornell.edu/~sridharan/concentration.pdf

CMU Note,http://www.stat.cmu.edu/~larry/=sml/Concentration.pdf

Stanford Note,http://cs229.stanford.edu/extra-notes/hoeffding.pdf

22 / 22