jeffrey c. jackson

36
An Efficient Membership- Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar Aizikowitz

Upload: signa

Post on 22-Feb-2016

30 views

Category:

Documents


1 download

DESCRIPTION

An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution. Jeffrey C. Jackson. Presented By: Eitan Yaakobi Tamar Aizikowitz. Presentation Outline. Introduction Algorithms We Use Estimating Expected Values Hypothesis Boosting - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Jeffrey C. Jackson

An Efficient Membership-Query Algorithm for Learning DNF with

Respect to the Uniform Distribution

Jeffrey C. JacksonPresented By:

Eitan YaakobiTamar Aizikowitz

Page 2: Jeffrey C. Jackson

2

Presentation Outline Introduction

Algorithms We Use Estimating Expected Values Hypothesis Boosting Finding Weak-approximating Parity Functions

Learning DNF With Respect to Uniform Existence of Weak Approximating Parity Functions for

every f, D Nonuniform Weak DNF Learning Strongly Learning DNF

Page 3: Jeffrey C. Jackson

3

Introduction DNF is weakly-learnable with respect to the

uniform distribution as shown by Kushilevitz and Mansour.

We show that DNF is weakly learnable with respect to a certain class of nonuniform distributions.

We then use a method based on Freund’s boosting algorithm to produce a strong learner with respect to uniform.

Page 4: Jeffrey C. Jackson

4

Algorithms We Use Our learning algorithm makes use of several

previous algorithms.

Following is a short reminder of these algorithms.

Page 5: Jeffrey C. Jackson

5

Estimating Expected Values The AMEAN Algorithm:

Efficiently estimates the expectancy of a random variable.

Based on Hoeffding’s inequality: Let Xi be independent random variables such that:

Xi , Xi[a,b] and E[Xi]=μ

then:2 22 /( )

1

1Pr 2m

m b ai

i

X em

Page 6: Jeffrey C. Jackson

6

The AMEAN Algorithm Input:

random X[a,b] b – a λ, > 0

Output: μ’ such that Pr[|E[X] – μ’| λ] 1 – δ

Running time: O((b-a)2log(δ-1) / λ2)

Page 7: Jeffrey C. Jackson

7

Hypothesis Boosting Our algorithm is based on boosting weak

hypotheses into a final strong hypothesis.

We use a boosting method very similar to Freund’s boosting algorithm.

We refer to Freund’s original algorithm as F1.

Page 8: Jeffrey C. Jackson

8

The F1 Boosting Algorithm Input:

positive ε, δ and γ (½ – γ)-approximate PAC learner for representation

class EX( f,D) for some f in and any distribution D

Output: ε-approximation for f with respect to D with

probability at least 1 – δ Running time:

polynomial in n, s, γ-1, ε -1, and log(δ -1)

Page 9: Jeffrey C. Jackson

9

The Idea Behind F1 (1) The algorithm generates a series of weak

hypotheses hi.

h0 is a weak approximator for f with respect to the distribution D.

Each subsequent hi is a weak approximator for f with respect to the distribution Di.

Page 10: Jeffrey C. Jackson

10

The Idea Behind F1 (2) Each distribution Di focuses weight on those

areas where slightly more than half the hypotheses already generated were incorrect.

The final hypothesis h is a majority vote on all the hi-s.

Page 11: Jeffrey C. Jackson

11

If a sufficient number of weak hypotheses is generated then h will be an ε-approximator for f with respect to the distribution D.

Freund showed that ½γ-2ln(ε-1) weak hypotheses suffice.

The Idea Behind F1 (3)

Page 12: Jeffrey C. Jackson

12

Finding Weak-approximating Parity Functions In order to use the boosting algorithm, we

need to be able to generate weak-approximators for our DNF f with respect to the distributions Di.

Our algorithm is based on the Weak Parity algorithm (WP) by Kushilevitz and Mansour.

Page 13: Jeffrey C. Jackson

13

The WP Algorithm Finds the large Fourier coefficients of a

Boolean function f on {0,1}n using a Membership Oracle for f.

ˆ.

A

A

f A E ff

each coefficient represents the correlation between and a parity

ˆ

A

f Af

For each that has the large coefficientproperty is a weak approximator for with respect to the uniform distribution.

Page 14: Jeffrey C. Jackson

14

The WP’ Algorithm (1) Our learning algorithm will need to find the

large coefficients of a non-Boolean function. The basic WP algorithm can be extended to

the WP’ algorithm which works for non-Boolean f as well.

WP’ gives us a weak approximator for a non-Boolean f with respect to the uniform distribution.

Page 15: Jeffrey C. Jackson

15

The WP’ Algorithm (2) Input:

MEM( f ) for f:{0,1}n→ θ, δ, n, L( f ) > 0

Output: With probability at least 1 – δ, WP’ outputs a set S such that

for all A:

Running time: 6 2 2 6log / /O nL f nL f

ˆ ( )f A A S

ˆ ( )2

A S f A

Page 16: Jeffrey C. Jackson

16

We now show the main result: DNF is learnable with respect to uniform.

We begin by showing that for every DNF f and distribution D there exists a parity function that weakly approximates f with respect to D.

We use this to produce an algorithm for weakly learning DNF with respect to certain nonuniform distributions.

Finally we show that this weak learner can be boosted into a strong learner with respect to the uniform distribution.

Learning DNF with Respect to Uniform

Page 17: Jeffrey C. Jackson

17

Existence of Weak Approximating Parity Functions for every f, D (1) For every DNF f and every distribution D

there exists a parity function that weakly approximates f with respect to D.

The more difficult case is when ED[ f ] ~ 0.

0DE f If is noticeably different than then orit's negation is a weak approximator.

Page 18: Jeffrey C. Jackson

18

Existence of Weak Approximating Parity Functions for every f, D (2)

Let f be a DNF such that E[ f ] ~ 0.

Let s be the number of terms in f.

Let T(x) be the {-1,+1} valued function equivalent to the term in f best correlated with f with respect to D.

Page 19: Jeffrey C. Jackson

19

Existence of Weak Approximating Parity Functions for every f, D (3)

10 Pr 1 Pr 12D D Df f f

1 Pr ( ) ( ) | ( ) 1 Pr ( ) ( ) | ( ) 12 D DT x f x f x T x f x f x

Pr ( ) ( )

Pr ( ) ( ) | ( ) 1 Pr ( ) 1

Pr ( ) ( ) | ( ) 1 Pr ( ) 1

D

D D

D D

T x f x

T x f x f x f x

T x f x f x f x

Page 20: Jeffrey C. Jackson

20

T is a term of f PrD [T(x) = f(x) | f(x) = -1] = 1

There are s terms in f, T is the best correlated with f PrD [ T(x) = f(x) | f(x) = 1 ] ≥ 1/s

PrD [ T(x) = f(x) ] ≥ 1/2(1 + 1/s) ED [fT] ≥ 1/s

1 Pr ( ) ( ) | ( ) 1 Pr ( ) ( ) | ( ) 12 D DT x f x f x T x f x f x L 1 Pr ( ) ( ) | ( ) 1 12 D T x f x f x L 1 1 12 s L

Existence of Weak Approximating Parity Functions for every f, D (4)

Page 21: Jeffrey C. Jackson

21

Existence of Weak Approximating Parity Functions for every f, D (5) T can be represented using the Fourier

transform. Define:

A T A T if is a subset of the variables in .

D

TE f T

Replace with it's Fourier representation in

1 s.t. (2 1)A D AT E f s L

Page 22: Jeffrey C. Jackson

22

Nonuniform Weak DNF Learning (1) We have shown that for every DNF f and

every distribution D there exists a parity function that is a weak approximator for f with respect to D.

How can we find such a parity function? We want an algorithm that when given a

threshold θ and a distribution D finds a parity such that, say: 2D AE f

Page 23: Jeffrey C. Jackson

23

1 2 ( ) ( ) ( )2

nAn

x

f x D x x

Nonuniform Weak DNF Learning (2)

( ) ( ) ( )D A Ax

E f f x x D x

( ) 2 ( ) ( )ng x f x D x

ˆ ( ) D Ag A E f

1 ˆ( ) ( ) ( )2 A Unif An

x

g x x E g g A

Page 24: Jeffrey C. Jackson

24

We have reduced the problem of finding a well correlated parity to finding a large Fourier coefficient of g.

g is not Boolean therefore we use WP’.

Invocation: WP’(n,MEM(g),θ,L(g) ,)

Nonuniform Weak DNF Learning (3)

MEM(g)(x) 2n MEM( f )(x) D

Page 25: Jeffrey C. Jackson

25

The WDNF Algorithm (1) We define a new algorithm: Weak DNF

(WDNF). WDNF finds the large Fourier coefficients of

g(x)=2nf(x)D(x) therefore finding a parity that is well correlated with f with respect to the distribution D.

WDNF makes use of the WP’ algorithm for finding the Fourier coefficients of the non-Boolean g.

Page 26: Jeffrey C. Jackson

26

Proof of Existence: Let g(x)=2nf(x)D(x)

Output with prob. 1 – :

Running Time:

poly. in n, s, log(-1), and L(2nD)

The WDNF Algorithm (2)

ˆ ( ) 1/(2 1)A D Ag A E f s s.t.

WP'( , ( ), 1/(2 1), (2 ), )nn MEM g s L D

1A D AE f s s.t.

Page 27: Jeffrey C. Jackson

27

The WDNF Algorithm (3) Input:

EX( f,D) MEM( f ) D δ > 0

Output: With probability at least 1 – δ :

parity function h (possibly negated) s.t.:ED[fh] = Ω(s-1)

Running time: polynomial in n, s, log(-1), and L(2nD)

Page 28: Jeffrey C. Jackson

28

The WDNF Algorithm (4) WDNF is polynomial in L(g) = L(2nD). If D is at most poly(n,s,ε, -1) / 2n then WDNF

runs polynomially in the normal parameters. Such D is referred to as polynomially-near

uniform.

WDNF weakly learns DNF with respect to any polynomially-near uniform distribution D.

Page 29: Jeffrey C. Jackson

29

We define the Harmonic Sieve Algorithm (HS).

HS is an application of the F1 boosting algorithm on the weak learner generated by WDNF.

The main difference between HS and F1 is the need to supply WDNF with an oracle for distribution Di at each stage of boosting.

Strongly Learning DNF

Page 30: Jeffrey C. Jackson

30

The HS Algorithm (1) Input:

EX( f,D) MEM( f ) D s ε, > 0

Output: With probability 1 – :

h s.t. h is an ε-approximator of f with respect to D. Running Time:

polynomial in n, s, ε-1, log(-1), and L(2nD)

Page 31: Jeffrey C. Jackson

31

For WDNF to work, and work efficiently, two requirements must be met: An oracle for the distribution must be provided for the

learner. The distribution must be polynomially-near uniform.

We show how to simulate an approximate oracle Di’ that can be provided to the weak learner instead of an exact one.

We then show that the distributions Di are in fact polynomially-near uniform.

The HS Algorithm (2)

Page 32: Jeffrey C. Jackson

32

Simulating Di (1) Define:

To provide an exact oracle we need to compute the denominator which could potentially take an exponentially long time.

Instead we will estimate the value of using AMEAN.

( )

( )

( )( )

( )i

i

ir x

i ir yy

D xD x

D y

( )( )i

ir yy

D y

Page 33: Jeffrey C. Jackson

33

Simulating Di (2) .

( )

, ( ) ( , ) draw example from and compute

i

ir x

X x f x EX f D

( )( ) / 3i

ir yy

E D y 2 / 3The algorithm guarantees E

( ) ( )2 ( ) 2 ( )3 i i

i ir y r yy y

D y E D y 1 3'2 2i i iD D D

AMEAN( , - , / 3, ') , ( )E X b a poly

1/ 2,3 / 2 , '( ) ( )i i i ic x D x c D x

Page 34: Jeffrey C. Jackson

34

Implications of Using Di’

Note that: gi’ = 2n f Di’ = 2nf ci Di = ci gi

Multiplying the distribution oracle by a constant is

like multiplying all the coefficients of gi by the same constant.

The relative sizes of the coefficients stay the same. WDNF will be able to find the large coefficients. The running time is not adversely affected.

'i i i i ig c g c g

Page 35: Jeffrey C. Jackson

35

Bound on Distributions Di

It can be shown that for each i:

Thus Di is bounded by a polynomial in L(D) and ε-1.

If is D polynomially-near uniform then Di is also polynomially-near.

HS strongly learns DNF with respect to the uniform distribution.

( ) 3 ( ) /iL D L D

Page 36: Jeffrey C. Jackson

36

Summary DNF can be weakly learned with respect to

polynomially-near distributions using the WDNF algorithm.

The HS algorithm strongly learns DNF with respect to the uniform distribution by boosting the WDNF weak learner.