ran el-yaniv and dmitry pechyony technion – israel institute of technology, haifa, israel...

Ran El-Yaniv and Dmitry Pechyony

Technion – Israel Institute of Technology ,Haifa, Israel

24.08.2007

Transductive Rademacher Complexity and its

Applications

Induction vs. TransductionInductive learning:

Distribution

of examples

training set learning algorit

hm

hypothesis

labels

unlabeled examples

Transductive learning (Vapnik ’74,’98):training set

test set

learning algorit

hm

labels of the test set

Goal: minimize

Goal: minimize

f (xi ;yi )gmi=1

SmM= f (xi ;yi )gm

i=1

X uM= fxi g

m+ui=m+1

L uM= E(x;y)2X u f (̀h(x);y)g

E(x;y)» D f (̀h(x);y)gD

(x;y)

Distribution-free Model [Vapnik ’74,’98]

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X XXX X

X

Given: “Full sample” of unlabeled examples, each with its true (unknown) label.

m+ um+ u

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X XXX X

X

Given: “Full sample” of unlabeled examples, each withits true (unknown) label.

m+ um+ u

Full sample is partitioned: training set (m points) test set (u points)


XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X

XX

X X XXX X

XLabels of the training

examples are revealed.


m+ um+ u



Labels of the training points are revealed.

Goal: Label test examples

X?

? X

??

? ?

X?

? ?

??

? ?

X?

X ?

??

? ?

??

X ?

??

? ? ?X? ?

?


m+ um+ u



Rademacher complexity Induction Hypothesis space : set of functions . - training points. - i.i.d. random

variables, Rademacher:

Transduction (version 1) Hypothesis space : set of vectors , .

- full sample with training and test

points. - distributed as in induction.

Rademacher:

X mX m ¾= f¾i gm

i=1

Rm(F ) = 1m EX m E¾f supf 2F

P mi=1 ¾i f (xi )gRm(F ) = 1

m EX m E¾f supf 2FP m

i=1 ¾i f (xi )g

FF f : D ! Rf : D ! R

Prf¾i = 1g= Prf¾i = ¡ 1g= 12

Prf¾i = 1g= Prf¾i = ¡ 1g= 12

Rm+u(H) = ( 1m + 1

u ) ¢E¾f suph2HP m+u

i=1 ¾i hi gRm+u(H) = ( 1m + 1

u ) ¢E¾f suph2HP m+u

i=1 ¾i hi g

HH hh H µ Rm+uH µ Rm+u

X m+uX m+u mm uu¾= f¾i gm+u

i=1

Transductive Rademacher complexity Version 1: - full sample with training and test

points. - transductive hypothesis space. - i.i.d. random

variables distributed by : .

Rademacher complexity:

Version 2: sparse distribution, , of Rademacher variables

We develop risk bounds with .

X m+uX m+u mm uu

¾= f¾i gm+ui=1HH

Rm+u(H;D1) = ( 1m + 1

u ) ¢E¾» D1 f suph2HP m+u

i=1 ¾i hi gRm+u(H;D1) = ( 1m + 1

u ) ¢E¾» D1 f suph2HP m+u

i=1 ¾i hi g

Prf¾i = 1g= Prf¾i = ¡ 1g= 12

Prf¾i = 1g= Prf¾i = ¡ 1g= 12

Prf¾i = 1g= Prf¾i = ¡ 1g= mu(m+u)2

Prf¾i = 1g= Prf¾i = ¡ 1g= mu(m+u)2

Prf¾i = 0g= 1¡ 2 mu(m+u)2

Prf¾i = 0g= 1¡ 2 mu(m+u)2

D1D1

DsDs

Lemma 1: .

Rm+u(H;Ds) · Rm+u(H;D1)Rm+u(H;Ds) · Rm+u(H;D1)

Rm+u(H) M= Rm+u(H;Ds)Rm+u(H) M= Rm+u(H;Ds)

Risk boundNotation: - 0/1 error of on test examples . - empirical -margin error of on training examples

.

Theorem: For any , with probability at least

over the random partition of the full sample into

, for all hypotheses it holds that

.

Proof: based on and inspired by the results of [McDiarmid, ‘89],

[Bartlett and Mendelson, ‘02] and [Meir and Zhang, ‘03].

Previous results: [Lanckriet et al., ‘04] - case of .

h 2 Hh 2 H

±> 0; ° > 0±> 0; ° > 0 1¡ ±1¡ ±

Sm+uSm+u

(Sm;X u)(Sm;X u)

L u(h)L u(h)

L °m(h)L °

m(h)hh X u

X u

hh SmSm°°

L u(h) · L °m(h) + 1

° Rm+u(H) + O³ q ¡

1m + 1

u

¢ln 1

±

´L u(h) · L °

m(h) + 1° Rm+u(H) + O

³ q ¡1m + 1

u

¢ln 1

±

´

m = um = u

Inductive vs. Transductive hypothesis spacesInduction: To use the risk bounds, the hypothesis space

shouldbe defined before observing the training set.

Transduction: The hypothesis space can be defined afterobserving , but before observing the actual

partition .

Conclusion: Transduction allows for the choosing a data-dependent hypothesis space. For example, it can beoptimized to have low Rademacher complexity.

This cannot be done in induction!

X m+uX m+u

(Sm;X u)(Sm;X u)

Another view on transductive algorithms

learner

compute KK

®®compute

h = K ®h = K ®

X m+uX m+u

(Sm;X u)(Sm;X u)

(m+ u) £ r(m+ u) £ rmatrix

r £ 1r £ 1vector

Example:

- inverse of graph Laplacian iff ; otherwise.

KK

®i = yi®i = yi xi 2 Sm

xi 2 Sm

®i = 0®i = 0

Unlabeled-Labeled Decomposition (ULD)

Bounding Rademacher complexity

Hypothesis space : the set of all , obtained by operatingtransductive algorithm on all possible partitions .

Notation: , - set of ‘s generated by . - all singular values of .

Lemma 2:

Lemma 2 justifies the spectral transformations performedto improve the performance of transductive algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang and Ando,‘05]).

.

= f ! i gri=1

= f ! i gri=1 KK

h = K ®h = K ®

HAHA hh

(Sm;X u)(Sm;X u)AA

TT ®® AA¹ = sup®2T f k®k2g¹ = sup®2T f k®k2g

Rm+u(HA ) · ¹q

2mu

P ri=1 ! 2

iRm+u(HA ) · ¹

q2

mu

P ri=1 ! 2

i

Bounds for graph-based algorithms

Consistency method [Zhou, Bousquet, Lal, Weston, Scholkopf,

‘03]:

where are singular values of .

Similar bounds for the algorithms of [Joachims,’03],

[Belkin et al., ‘04], etc.

Rm+u(HA ) ·q

2u

P m+ui=1 ! 2

iRm+u(HA ) ·

q2u

P m+ui=1 ! 2

i

f ! i gm+ui=1

f ! i gm+ui=1 KK

Topics not covered Bounding the Rademacher complexity

when is a kernel matrix. For some algorithms: data-dependent

method of computing probabilistic upper and lower bounds on Rademacher complexity.

Risk bound for transductive mixtures.

KK

Direction for future research

Tighten the risk bound to allow effective model

selection: Bound depending on 0/1 empirical error. Usage of variance information to obtain

better convergence rate. Local transductive Rademacher

complexity. Clever data-dependent choice of low-

Rademacher hypothesis spaces.

Monte Carlo estimation of transductive Rademacher complexity

Rm+u(H) = ( 1m + 1

u ) ¢E¾f suph2H ¾¢hgRm+u(H) = ( 1m + 1

u ) ¢E¾f suph2H ¾¢hgRademacher: .

Draw uniformly vectors of Rademacher variables, .

By Hoeffding inequality: for any , with prob. at least ,

.

How to compute the supremum? For the Consistency Method of [Zhou et al., ‘03] can be computed in time.

Symmetric Hoeffding inequality probabilistic lower bound on the transductive Rademacher complexity.

±> 0±> 0

nn ¾(1); : : : ;¾(n)¾(1); : : : ;¾(n)

1¡ ±1¡ ±

Rm+u(H) · ( 1m + 1

u ) ¢1n

P ni=1 suph2H ¾(i ) ¢h + O

³ q1n ln 1

±

´Rm+u(H) · ( 1

m + 1u ) ¢1

n

P ni=1 suph2H ¾(i ) ¢h + O

³ q1n ln 1

±

´

suph2H ¾(i ) ¢hsuph2H ¾(i ) ¢h

O¡(m+ u)2

¢O¡(m+ u)2

¢

!!

Induction vs. Transduction: differences

Induction Unknown

underlying distribution

Transduction No unknown

distribution. Each example has unique label.

Test examples not known. Will be sampled from the same distribution.

Test examples are known.

Generate a general hypothesis.

Want generalization!

Only classify given examples.

No generalization!

Independent training examples.

Dependent training and test examples.

Justification of spectral transformations

, - set of ‘s generated by .

- all singular values of .

Lemma 2: . Lemma 2 justifies the spectral transformations

performedto improve the performance of transductive

algorithms ([Chapelle et al.,’02], [Joachims,’03], [Zhang

and Ando,‘05]).

= f ! i gri=1

= f ! i gri=1 KK

TT ®® AA¹ = sup®2T f k®k2g¹ = sup®2T f k®k2g

Rm+u(HA ) · ¹q

2mu

P ri=1 ! 2

iRm+u(HA ) · ¹

q2

mu

P ri=1 ! 2

i

ran el-yaniv and dmitry pechyony technion – israel institute of technology, haifa, israel...

Documents

training examples

sample of unlabeled

training points

test examples

rademacher variables

low rademacher complexity

hypothesis space shouldbe

test points