jeffrey c. jackson

An Efficient Membership-Query Algorithm for Learning DNF with

Respect to the Uniform Distribution

Jeffrey C. JacksonPresented By:

Eitan YaakobiTamar Aizikowitz

2

Presentation Outline Introduction

Algorithms We Use Estimating Expected Values Hypothesis Boosting Finding Weak-approximating Parity Functions

Learning DNF With Respect to Uniform Existence of Weak Approximating Parity Functions for

every f, D Nonuniform Weak DNF Learning Strongly Learning DNF

3

Introduction DNF is weakly-learnable with respect to the

uniform distribution as shown by Kushilevitz and Mansour.

We show that DNF is weakly learnable with respect to a certain class of nonuniform distributions.

We then use a method based on Freund’s boosting algorithm to produce a strong learner with respect to uniform.

4

Algorithms We Use Our learning algorithm makes use of several

previous algorithms.

Following is a short reminder of these algorithms.

5

Estimating Expected Values The AMEAN Algorithm:

Efficiently estimates the expectancy of a random variable.

Based on Hoeffding’s inequality: Let Xi be independent random variables such that:

Xi , Xi[a,b] and E[Xi]=μ

then:2 22 /( )

1

1Pr 2m

m b ai

i

X em

6

The AMEAN Algorithm Input:

random X[a,b] b – a λ, > 0

Output: μ’ such that Pr[|E[X] – μ’| λ] 1 – δ

Running time: O((b-a)2log(δ-1) / λ2)

7

Hypothesis Boosting Our algorithm is based on boosting weak

hypotheses into a final strong hypothesis.

We use a boosting method very similar to Freund’s boosting algorithm.

We refer to Freund’s original algorithm as F1.

8

The F1 Boosting Algorithm Input:

positive ε, δ and γ (½ – γ)-approximate PAC learner for representation

class EX( f,D) for some f in and any distribution D

Output: ε-approximation for f with respect to D with

probability at least 1 – δ Running time:

polynomial in n, s, γ-1, ε -1, and log(δ -1)

9

The Idea Behind F1 (1) The algorithm generates a series of weak

hypotheses hi.

h0 is a weak approximator for f with respect to the distribution D.

Each subsequent hi is a weak approximator for f with respect to the distribution Di.

10

The Idea Behind F1 (2) Each distribution Di focuses weight on those

areas where slightly more than half the hypotheses already generated were incorrect.

The final hypothesis h is a majority vote on all the hi-s.

11

If a sufficient number of weak hypotheses is generated then h will be an ε-approximator for f with respect to the distribution D.

Freund showed that ½γ-2ln(ε-1) weak hypotheses suffice.

The Idea Behind F1 (3)

12

Finding Weak-approximating Parity Functions In order to use the boosting algorithm, we

need to be able to generate weak-approximators for our DNF f with respect to the distributions Di.

Our algorithm is based on the Weak Parity algorithm (WP) by Kushilevitz and Mansour.

13

The WP Algorithm Finds the large Fourier coefficients of a

Boolean function f on {0,1}n using a Membership Oracle for f.

ˆ.

A

A

f A E ff

each coefficient represents the correlation between and a parity

ˆ

A

f Af

For each that has the large coefficientproperty is a weak approximator for with respect to the uniform distribution.

14

The WP’ Algorithm (1) Our learning algorithm will need to find the

large coefficients of a non-Boolean function. The basic WP algorithm can be extended to

the WP’ algorithm which works for non-Boolean f as well.

WP’ gives us a weak approximator for a non-Boolean f with respect to the uniform distribution.

15

The WP’ Algorithm (2) Input:

MEM( f ) for f:{0,1}n→ θ, δ, n, L( f ) > 0

Output: With probability at least 1 – δ, WP’ outputs a set S such that

for all A:

Running time: 6 2 2 6log / /O nL f nL f

ˆ ( )f A A S

ˆ ( )2

A S f A

16

We now show the main result: DNF is learnable with respect to uniform.

We begin by showing that for every DNF f and distribution D there exists a parity function that weakly approximates f with respect to D.

We use this to produce an algorithm for weakly learning DNF with respect to certain nonuniform distributions.

Finally we show that this weak learner can be boosted into a strong learner with respect to the uniform distribution.

Learning DNF with Respect to Uniform

17

Existence of Weak Approximating Parity Functions for every f, D (1) For every DNF f and every distribution D

there exists a parity function that weakly approximates f with respect to D.

The more difficult case is when ED[ f ] ~ 0.

0DE f If is noticeably different than then orit's negation is a weak approximator.

18

Existence of Weak Approximating Parity Functions for every f, D (2)

Let f be a DNF such that E[ f ] ~ 0.

Let s be the number of terms in f.

Let T(x) be the {-1,+1} valued function equivalent to the term in f best correlated with f with respect to D.

19


10 Pr 1 Pr 12D D Df f f

1 Pr ( ) ( ) | ( ) 1 Pr ( ) ( ) | ( ) 12 D DT x f x f x T x f x f x

Pr ( ) ( )

Pr ( ) ( ) | ( ) 1 Pr ( ) 1

Pr ( ) ( ) | ( ) 1 Pr ( ) 1

D

D D

D D

T x f x

T x f x f x f x

T x f x f x f x

20

T is a term of f PrD [T(x) = f(x) | f(x) = -1] = 1

There are s terms in f, T is the best correlated with f PrD [ T(x) = f(x) | f(x) = 1 ] ≥ 1/s

PrD [ T(x) = f(x) ] ≥ 1/2(1 + 1/s) ED [fT] ≥ 1/s

1 Pr ( ) ( ) | ( ) 1 Pr ( ) ( ) | ( ) 12 D DT x f x f x T x f x f x L 1 Pr ( ) ( ) | ( ) 1 12 D T x f x f x L 1 1 12 s L


21

Existence of Weak Approximating Parity Functions for every f, D (5) T can be represented using the Fourier

transform. Define:

A T A T if is a subset of the variables in .

D

TE f T

Replace with it's Fourier representation in

1 s.t. (2 1)A D AT E f s L

22

Nonuniform Weak DNF Learning (1) We have shown that for every DNF f and

every distribution D there exists a parity function that is a weak approximator for f with respect to D.

How can we find such a parity function? We want an algorithm that when given a

threshold θ and a distribution D finds a parity such that, say: 2D AE f

23

1 2 ( ) ( ) ( )2

nAn

x

f x D x x

Nonuniform Weak DNF Learning (2)

( ) ( ) ( )D A Ax

E f f x x D x

( ) 2 ( ) ( )ng x f x D x

ˆ ( ) D Ag A E f

1 ˆ( ) ( ) ( )2 A Unif An

x

g x x E g g A

24

We have reduced the problem of finding a well correlated parity to finding a large Fourier coefficient of g.

g is not Boolean therefore we use WP’.

Invocation: WP’(n,MEM(g),θ,L(g) ,)

Nonuniform Weak DNF Learning (3)

MEM(g)(x) 2n MEM( f )(x) D

25

The WDNF Algorithm (1) We define a new algorithm: Weak DNF

(WDNF). WDNF finds the large Fourier coefficients of

g(x)=2nf(x)D(x) therefore finding a parity that is well correlated with f with respect to the distribution D.

WDNF makes use of the WP’ algorithm for finding the Fourier coefficients of the non-Boolean g.

26

Proof of Existence: Let g(x)=2nf(x)D(x)

Output with prob. 1 – :

Running Time:

poly. in n, s, log(-1), and L(2nD)

The WDNF Algorithm (2)

ˆ ( ) 1/(2 1)A D Ag A E f s s.t.

WP'( , ( ), 1/(2 1), (2 ), )nn MEM g s L D

1A D AE f s s.t.

27

The WDNF Algorithm (3) Input:

EX( f,D) MEM( f ) D δ > 0

Output: With probability at least 1 – δ :

parity function h (possibly negated) s.t.:ED[fh] = Ω(s-1)

Running time: polynomial in n, s, log(-1), and L(2nD)

28

The WDNF Algorithm (4) WDNF is polynomial in L(g) = L(2nD). If D is at most poly(n,s,ε, -1) / 2n then WDNF

runs polynomially in the normal parameters. Such D is referred to as polynomially-near

uniform.

WDNF weakly learns DNF with respect to any polynomially-near uniform distribution D.

29

We define the Harmonic Sieve Algorithm (HS).

HS is an application of the F1 boosting algorithm on the weak learner generated by WDNF.

The main difference between HS and F1 is the need to supply WDNF with an oracle for distribution Di at each stage of boosting.

Strongly Learning DNF

30

The HS Algorithm (1) Input:

EX( f,D) MEM( f ) D s ε, > 0

Output: With probability 1 – :

h s.t. h is an ε-approximator of f with respect to D. Running Time:

polynomial in n, s, ε-1, log(-1), and L(2nD)

31

For WDNF to work, and work efficiently, two requirements must be met: An oracle for the distribution must be provided for the

learner. The distribution must be polynomially-near uniform.

We show how to simulate an approximate oracle Di’ that can be provided to the weak learner instead of an exact one.

We then show that the distributions Di are in fact polynomially-near uniform.

The HS Algorithm (2)

32

Simulating Di (1) Define:

To provide an exact oracle we need to compute the denominator which could potentially take an exponentially long time.

Instead we will estimate the value of using AMEAN.

( )

( )

( )( )

( )i

i

ir x

i ir yy

D xD x

D y

( )( )i

ir yy

D y

33

Simulating Di (2) .

( )

, ( ) ( , ) draw example from and compute

i

ir x

X x f x EX f D

( )( ) / 3i

ir yy

E D y 2 / 3The algorithm guarantees E

( ) ( )2 ( ) 2 ( )3 i i

i ir y r yy y

D y E D y 1 3'2 2i i iD D D

AMEAN( , - , / 3, ') , ( )E X b a poly

1/ 2,3 / 2 , '( ) ( )i i i ic x D x c D x

34

Implications of Using Di’

Note that: gi’ = 2n f Di’ = 2nf ci Di = ci gi

Multiplying the distribution oracle by a constant is

like multiplying all the coefficients of gi by the same constant.

The relative sizes of the coefficients stay the same. WDNF will be able to find the large coefficients. The running time is not adversely affected.

'i i i i ig c g c g

35

Bound on Distributions Di

It can be shown that for each i:

Thus Di is bounded by a polynomial in L(D) and ε-1.

If is D polynomially-near uniform then Di is also polynomially-near.

HS strongly learns DNF with respect to the uniform distribution.

( ) 3 ( ) /iL D L D

36

Summary DNF can be weakly learned with respect to

polynomially-near distributions using the WDNF algorithm.

The HS algorithm strongly learns DNF with respect to the uniform distribution by boosting the WDNF weak learner.

jeffrey c. jackson

Documents

boosting algorithm

weak parity algorithm

weak approximator

basic wp algorithm

useour learning algorithm

boolean function f

freunds original algorithm

series of weak hypotheses