1 statistical mechanics of online learning for ensemble teachers seiji miyoshi masato okada kobe...

16
1 Statistical Mechanics of Online Learning for Ensemble Teachers Seiji Miyoshi Masato Okada Kobe City College of Tech. Univ. of Tokyo, RIKEN BSI

Upload: cameron-dawson

Post on 17-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

1

Statistical Mechanics of Online Learning for Ensemble Teachers

Seiji Miyoshi Masato OkadaKobe City College of Tech. Univ. of Tokyo, RIKEN BSI

2

S U M M A R YS U M M A R Y

We analyze the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student. Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we prove that when the learning rate satisfies η<1, the larger the number K is and the more variety the teachers have, the smaller the generalization error is. On the other hand, when η>1, the properties are completely reversed. If the variety of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.

3

B A C K G R O U N DB A C K G R O U N D    (1/2)(1/2)

• Batch learning– given examples are used more than once

– student becomes to give correct answers for all examples– long time and large memory

• On-line learning– examples once used are discarded

– cannot give correct answers for all examples used in training

– large memory is not necessary– it is possible to follow a time variant teacher

4

B A C K G R O U N DB A C K G R O U N D    (2/2)(2/2)

P U R P O S EP U R P O S E

In most cases in an actual human society, a student can observe examples from two or more teachers who differ from each other.

To analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher

To discuss the relationship between the number, the variety of ensemble teachers and the generalization error

5

B B

B B

R

q

R

RJ

B k

k'k

K 1

BkJ

kk'

A

J

M O D E L (1/4)M O D E L (1/4)True teacher

Student

• J learns B1,B2, ・・・ in turn.

• J can not learn A directly.

• A, B1,B2, ・・・ ,J are linear perceptrons with noises.

Ensemble teachers

6

M O D E L (2/4)M O D E L (2/4)• Output of true teacher

• Outputs of ensemble teachers

• Output of student

Linear perceptron Gaussian noise

Linear perceptrons

Linear perceptron

Gaussian noises

Gaussian noise

7

M O D E L (3/4)M O D E L (3/4)• Inputs:  • Initial value of student:• True teacher: • Ensemble teachers:• N→∞ (Thermodynamic limit)

• Order parameters– Length of student– Direction cosines

8

M O D E L (4/4)M O D E L (4/4)

fkm

Gradient method

Squared errors

Student learns K ensemble teachers in turn.

9

GENERALIZATION ERRORGENERALIZATION ERROR• A goal of statistical learning theory is to obtain generalization error theoretically.• Generalization error = mean of errors over the distribution of new input

ErrorMultiple Gaussian

Distribution

10

Differential equations, which describe the dynamical behaviors Differential equations, which describe the dynamical behaviors of order parameters, have been obtained based on of order parameters, have been obtained based on

self-averagingself-averaging in the in the thermodynamic limitsthermodynamic limits as follows: as follows:

Jm+1 = Jm + fkm xm

+

NrJm+1 = NrJ

m + fkm ym

Ndt inputs

A is multiplied to both side of

NrJm+2 = NrJ

m+1 + fkm+1 ym+1

NrJm+Ndt = NrJ

m+Ndt-1 + fkm+Ndt-1 ym+Ndt-1

1. To simplify the analysis, the following auxiliary order parameters are introduced:

2.

3.

11

Simultaneous differential equations in deterministic forms, Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameterswhich describe dynamical behaviors of order parameters

12

Analytical solutions of order parametersAnalytical solutions of order parameters

13

Dynamical behaviors of generalization error, Dynamical behaviors of generalization error, R R and and ll

( η=0.3, K=3, RB=0.7, σA2=0.0, σB

2=0.1, σJ2=0.2 )

Student becomes cleverer than a member of ensemble teachers.

The larger the variety of the ensembleteachers is, the nearer the student and true teacher are.

Ord

er P

ara

me

ters

t=m/N

q=1.00

l

R

q=0.80q=0.60q=0.49

0 5

0.2

0.4

0.6

0.0

1.0

0.8

10 15 20

Student

Ensembleteachers

Ge

ner

ali

zati

on

Err

or

t=m/N

q=1.00q=0.80q=0.60q=0.49

0 50.2

0.4

0.6

1.2

1.0

0.8

10 15 20

14

Steady state analysisSteady state analysis ( ( tt → ∞ → ∞ ))・ If η <0  or   η >2

・ If 0< η <2

Generalization error and length of student diverge.

If η <1 , the more teachers exist or the richer the variety of teachers is, the cleverer the student can become. 

If η >1 , the fewer teachers exist or the poorer the variety of teachers is, the cleverer the student can become.

15

Steady value of generalization error, Steady value of generalization error, R R and and ll

( K=3, RB=0.7, σA2=0.0, σB

2=0.1, σJ2=0.2 )

Rich variety is good !

0 0.5 1 1.5 2

q=1.00q=0.80q=0.60q=0.49

η

0.2

0.4R

0.6

0.0

0.8

Poor variety is good !

Ge

ner

ali

zati

on

Err

or

00.1

1

10

0.5 1 1.5 2

q=1.00q=0.80q=0.60q=0.49

η

16

Steady value of generalization error, Steady value of generalization error, R R and and ll

( q=0.49, RB=0.7, σA2=0.0, σB

2=0.1, σJ2=0.2 )

Many teachers are good ! Few teachers are good !

0 0.5 1 1.5 2

η

K=1K=3K=10K=30

0.2

0.4R

0.6

0.0

0.8

1.0

Ge

ner

ali

zati

on

Err

or

0.1

1

10

0 0.5 1 1.5 2

η

K=1K=3K=10K=30