1 statistical mechanics of online learning for ensemble teachers seiji miyoshi masato okada kobe...
TRANSCRIPT
1
Statistical Mechanics of Online Learning for Ensemble Teachers
Seiji Miyoshi Masato OkadaKobe City College of Tech. Univ. of Tokyo, RIKEN BSI
2
S U M M A R YS U M M A R Y
We analyze the generalization performance of a student in a model composed of linear perceptrons: a true teacher, K teachers, and the student. Calculating the generalization error of the student analytically using statistical mechanics in the framework of on-line learning, we prove that when the learning rate satisfies η<1, the larger the number K is and the more variety the teachers have, the smaller the generalization error is. On the other hand, when η>1, the properties are completely reversed. If the variety of the K teachers is rich enough, the direction cosine between the true teacher and the student becomes unity in the limit of η→0 and K→∞.
3
B A C K G R O U N DB A C K G R O U N D (1/2)(1/2)
• Batch learning– given examples are used more than once
– student becomes to give correct answers for all examples– long time and large memory
• On-line learning– examples once used are discarded
– cannot give correct answers for all examples used in training
– large memory is not necessary– it is possible to follow a time variant teacher
4
B A C K G R O U N DB A C K G R O U N D (2/2)(2/2)
P U R P O S EP U R P O S E
In most cases in an actual human society, a student can observe examples from two or more teachers who differ from each other.
To analyze generalization performance of a model composed of a student, a true teacher and K teachers (ensemble teachers) who exist around the true teacher
To discuss the relationship between the number, the variety of ensemble teachers and the generalization error
5
B B
B B
R
q
R
RJ
B k
k'k
K 1
BkJ
kk'
A
J
M O D E L (1/4)M O D E L (1/4)True teacher
Student
• J learns B1,B2, ・・・ in turn.
• J can not learn A directly.
• A, B1,B2, ・・・ ,J are linear perceptrons with noises.
Ensemble teachers
6
M O D E L (2/4)M O D E L (2/4)• Output of true teacher
• Outputs of ensemble teachers
• Output of student
Linear perceptron Gaussian noise
Linear perceptrons
Linear perceptron
Gaussian noises
Gaussian noise
7
M O D E L (3/4)M O D E L (3/4)• Inputs: • Initial value of student:• True teacher: • Ensemble teachers:• N→∞ (Thermodynamic limit)
• Order parameters– Length of student– Direction cosines
8
M O D E L (4/4)M O D E L (4/4)
fkm
Gradient method
Squared errors
Student learns K ensemble teachers in turn.
9
GENERALIZATION ERRORGENERALIZATION ERROR• A goal of statistical learning theory is to obtain generalization error theoretically.• Generalization error = mean of errors over the distribution of new input
ErrorMultiple Gaussian
Distribution
10
Differential equations, which describe the dynamical behaviors Differential equations, which describe the dynamical behaviors of order parameters, have been obtained based on of order parameters, have been obtained based on
self-averagingself-averaging in the in the thermodynamic limitsthermodynamic limits as follows: as follows:
Jm+1 = Jm + fkm xm
+
NrJm+1 = NrJ
m + fkm ym
Ndt inputs
A is multiplied to both side of
NrJm+2 = NrJ
m+1 + fkm+1 ym+1
NrJm+Ndt = NrJ
m+Ndt-1 + fkm+Ndt-1 ym+Ndt-1
1. To simplify the analysis, the following auxiliary order parameters are introduced:
2.
3.
11
Simultaneous differential equations in deterministic forms, Simultaneous differential equations in deterministic forms, which describe dynamical behaviors of order parameterswhich describe dynamical behaviors of order parameters
13
Dynamical behaviors of generalization error, Dynamical behaviors of generalization error, R R and and ll
( η=0.3, K=3, RB=0.7, σA2=0.0, σB
2=0.1, σJ2=0.2 )
Student becomes cleverer than a member of ensemble teachers.
The larger the variety of the ensembleteachers is, the nearer the student and true teacher are.
Ord
er P
ara
me
ters
t=m/N
q=1.00
l
R
q=0.80q=0.60q=0.49
0 5
0.2
0.4
0.6
0.0
1.0
0.8
10 15 20
Student
Ensembleteachers
Ge
ner
ali
zati
on
Err
or
t=m/N
q=1.00q=0.80q=0.60q=0.49
0 50.2
0.4
0.6
1.2
1.0
0.8
10 15 20
14
Steady state analysisSteady state analysis ( ( tt → ∞ → ∞ ))・ If η <0 or η >2
・ If 0< η <2
Generalization error and length of student diverge.
If η <1 , the more teachers exist or the richer the variety of teachers is, the cleverer the student can become.
If η >1 , the fewer teachers exist or the poorer the variety of teachers is, the cleverer the student can become.
15
Steady value of generalization error, Steady value of generalization error, R R and and ll
( K=3, RB=0.7, σA2=0.0, σB
2=0.1, σJ2=0.2 )
Rich variety is good !
0 0.5 1 1.5 2
q=1.00q=0.80q=0.60q=0.49
η
0.2
0.4R
0.6
0.0
0.8
Poor variety is good !
Ge
ner
ali
zati
on
Err
or
00.1
1
10
0.5 1 1.5 2
q=1.00q=0.80q=0.60q=0.49
η
16
Steady value of generalization error, Steady value of generalization error, R R and and ll
( q=0.49, RB=0.7, σA2=0.0, σB
2=0.1, σJ2=0.2 )
Many teachers are good ! Few teachers are good !
0 0.5 1 1.5 2
η
K=1K=3K=10K=30
0.2
0.4R
0.6
0.0
0.8
1.0
Ge
ner
ali
zati
on
Err
or
0.1
1
10
0 0.5 1 1.5 2
η
K=1K=3K=10K=30