efficient training in high-dimensional weight space theoretische physik und astrophysik...

25
Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg Am Hubland, D-97074 Würzburg, Germany http://theorie.physik.uni-wuerzburg.de/~biehl Wiskunde & Informatica Intelligent Systems Rijksuniversiteit Groningen, Postbus 800, NL-9718 DD Groningen, The Netherlands [email protected], www.cs.rug.nl/~biehl , Michael Biehl Michael Biehl Christoph Bunzmann, Robert Urbanczik

Post on 15-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Efficient Training in high-dimensional weight space

Theoretische Physik und Astrophysik

Computational Physics

Julius-Maximilians-Universität Würzburg

Am Hubland, D-97074 Würzburg, Germany

http://theorie.physik.uni-wuerzburg.de/~biehl

Wiskunde & Informatica

Intelligent Systems

Rijksuniversiteit Groningen, Postbus 800,

NL-9718 DD Groningen, The Netherlands

[email protected], www.cs.rug.nl/~biehl

, Michael Biehl Michael BiehlChristoph Bunzmann,Robert Urbanczik

Page 2: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Efficient training in high-dimensional weight space

Learning from examples

A model situation layered neural networks student teacher scenario

The dynamics of on-line learning on-line gradient descent delayed learning, plateau states

Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results

Summary, Outlook selected further topics prospective projects

Page 3: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Learning from examples

choice of adjustable parameters in

adaptive information processing systems

· based on example data, e.g. input/output pairs in

classification tasks

time series prediction

regression problems

supervised

learning

· parameterizes a hypothesis

e.g. for an unknown classification or regression task

· guided by the optimization of an appropriate

objective or cost function

e.g. performance with respect to the example data

· results in generalization ability

e.g. the successful classification of novel data

Page 4: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Theory of learning processes

· description of specific e.g. hand written digit recognition applications

- particular training scheme

- given real world problem

- special set of example data ...

· typical properties of e.g. learning curves

model scenarios - network architecture- statistics of data, noise

understanding/prediction of relevant phenomena, algorithm design

trade off: general validity / applicability

- learning algorithm

· general results

- statistical properties of data - specific task

- details of training procedure ...

independent of

e.g. performance bounds

Page 5: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

NRI ξ

input data

sigmoidal hidden

activation, e.g. g(x) = erf (a x)

ξ w x

A two-layered network: the soft committee machine

K

1k k ξw g σ

input/output relation RIRI N

( fixed hidden to output weights )

Nk RIw

adaptive weights

ξ w g σ kk

hidden units K21k ...,,

SCM+ adaptive thresholds:universal approximator

Page 6: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

M K ideal situation: perfectly matching complexity

Student teacher scenario

M K unlearnable rule

M K over-sophisticated student

interesting effects

relevant cases

K

1k k ξw g ξ σ

adaptive student

hidden unitsK

M

1m m ξw g ξ τ

*

teacher

M

(best) rule parameterization

? ? ? ? ? ? ?

5

Page 7: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

training based on the performance w.r.t. example data, e.g.

2μμP

μP

1μ )ξ ( τ - )ξ ( σ

21

P1

ξ e P1

E

input/output pairs: P

1μμμ )ξ ( τ ,ξ DI

examples for the unknown function or rule ) ξ ( τ

(reliable)

evaluation after training

generalization error ξ

ξ e e G

expected error for a novel input DI τ ,ξ

w.r.t. density of inputs / set of test inputs

Page 8: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Statistical Physics approach

· consider large systems, in the thermodynamic limit N (K,M«N)

dimension of input data

number of adjustable parametersN

· perform averages

over stochastic training process T

over randomized example data, quenched disorder DI

(technically) simplest case: reliable teacher outputs,

isotropic input density: independent components

with zero mean / unit variance

· evaluate typical properties

e.g. the learning curve P vs. e IDT G

· description in terms of macroscopic quantities

e.g. overlap parameters

student/teacher similarity measure

jiij*mjjm w w Q ,w w R

next: eg

Page 9: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

The generalization error

)x ( g - )x ( g ) ξ ( e e 2 *

mM

1mkK

1kG 2K1

ξ w x , ξ w x *m

*mk k

(sums of many random numbers)

Central Limit Theorem: correlated Gaussians for large N

0 x x *m k jkkj k j Q w w xx

mn*n

*m

*n

*m jm

*mj

*m j δ w w xx R w w xx

first and second

moments:

jk jm,GkG QR e ) ξ ( e w e

averages over integrals overξ *

m k x,x

K N

microscopic

macroscopic

½ (K2+K) + K M

Page 10: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Dynamics of on-line gradient descent

presentation of single examples

weights after presentation of: w 1-μ1-μ examples

On-line learning step: e η/N w w1-μ

μ1-μ μ

novel, random example: , )ξ ( τ ,ξ μμ 2

μμμ )ξ ( τ - )ξ ( σ 21

e

number of examples discrete learning time

· no explicit storage of all examples ID required

· little computational effort per examplepractical advantages:

mathematical ease: typical dynamics of learning can be evaluated on

average over a randomized sequence of examples

coupled ODEs for {Rjm,Qij} in time =P/(KN)

Page 11: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

projections *m

μkkm

μk

μjjk ww ) μ (Rww ) μ (Q

recursions, e.g. μ*m

μk

μμkmkm x(xg'τσ η N / 1

) 1-μ (R - ) μ (R)

large N • average over latest example Gaussian ξμ *μ

m μk xx ,

• mean recursions coupled ODE in continuous time

N K μ

α training time

~ examples per weight

learning curve ) α ( e G ), α (R ), α (Q kmjk

Page 12: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

100 200 3000

0.01

0.02

0.03

0.04

0.05

0

eG

= P/(KN)

Biehl, Riegler, WöhlerJ.Phys. A (1996) 4769

perfect generalization

0 e G

fast initial decrease

example: K = M = 2, = 1.5, Rij(0) 0

quasi-stationary plateau states with all

dominate the learning process

R Rij

w wj ijij orth. *

unspecialized student weights

10

learning curve

ah

a!

Page 13: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

example: K = M = 2, Tmn = mn, = 1, Rij(0) 0,

100 200 3000

0.5

1.0

0.0

R11, R22 Q11, Q22

R12, R21 Q21= Q21

1w

2w

* 1w

* 2w

permutation symmetry of branches in the student network

evolution of overlap parameters

Page 14: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

N

Qjm

mean

s tan

dard

devi a

t ion

quantity

Monte Carlo simulations self-averaging

1/N

1/N

Page 15: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Plateau length

platα if all

assume randomized initialization of weight vectors

NlnN)( P, K N,lnα N1 O(0)R plat jk examples needed for successful learning !

KN P for RR jmjj hidden unit specialization

requires a priori knowledge (initial macroscopic overlaps)

property of the learning scenario

necessary phase of training

or

artifact of the training prescription

???

R(0)R jk exactly

self-avg.

Page 16: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications

t testE

Page 17: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Training by Principal Component Analysis

problem: delayed specialization in ( K N ) dimensional weight space

idea:

A) identification (approximation) of the subspace of

B) actual training within this low-dimensional space

* mw

Σλ 1 eigenvector

M

1m

* mΣ ww

λ ( K-1 ) e.v. ) M2,3,m (*

m* 1m wwΔ

example: soft committee teacher (K=M), isotropic input density

modified correlation matrix ji2

ijT2 ξ ξ)ξ(τC ,ξ ξ)ξ(τC

eigenvalues and eigenvectors: ΔoΣ λλ λ

( N-K ) e.v. * mw u

Page 18: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

empirical estimate from a limited data set

P

μ

j

μ

i μ2P

ij ξ ξ )ξ (τ P1

C

· optimization of w.r.t. E kja ( K2 K N coefficients)

( # of examples P = NK K2 )

note: required memory N2 does not increase with P

) K1,2,j (Pk

K

1k

kjj Δaw

· representation of student weights

B) specialization in the K - dimensional space of PkΔ

· determinelargest eigenvalue, e.v.

(K-1) smallest eigenvalues, e.v. K),2,(kPkΔ

P1

PΣ Δw

1

Page 19: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

DItypical properties: given a random set of P = N K examples

formal partition sum

quenched free energy

D)(IC P β -exp d Z PT N

ID Z ln ~ replica trick

saddle point integration limit

typical overlap with teacher weightsIDT

2 *ii

2 w ( ρ )

measures the success of teacher space identification A)

B) given , determine the optimal eG

achievable by a linear combination of

Δ i

Page 20: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

K = 3, Statistical Physics theory and simulations, N = 400 (), N = 1600 (•)

B)

P = K N examples

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

A)

c

B) given , determine the optimal eG

achievable by a linear combination of

Δ i

Page 21: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

K = 3, theory and Monte Carlo simulations, N = 400 (), N = 1600 (•)

c

P = K N examples

c (K=2) = 4.49

c (K=3) = 8.70

large K theory:

c (K) ~ 2.94 K (N-indep.!)

A)

B)

Bunzmann, Biehl, UrbanczikPhys. Rev. Lett. 86, 2166 (2001)

unspecialized

specialized

specialization without

a priori knowledge

( c independent of N )

15

Page 22: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

spectrum of matrix CP, teacher with M = 7 hidden units

K-1 = 6 smallest eigenvalues

algorithm requires no

prior knowledge of M

PCA hints at the required

model complexity

potential application: model selection

Page 23: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

· model situation, supervised learning

- the soft committee machine

- student teacher scenario

- randomized training data

· dynamics of on-line gradient descent

- delayed learning due to symmetry breaking

necessary specialization processes

· statistical physics inspired approach

- large systems

- thermal (training) and disorder (data) average

- typical, macroscopic properties

Summary

· efficient training

- PCA based learning algorithm

reduces dimensionality of the problem

- specialization without a priori knowledge

Page 24: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

Further topics · perceptron training (single layer) optimal stability classification dynamics of learning

· unsupervised learning principal component analysis competitive learning, clustered data

· specialization processes discontinuous learning curves delayed learning, plateau states

· dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks

· algorithm design variational method, optimal algorithms construction algorithm

· non-trivial statistics of data learning from noisy data time-dependent rules

Page 25: Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg

· unsupervised learning

density estimation, feature detection,

clustering, (Learning) Vector Quantization

compression, self-organizing maps

· application relevant architectures and algorithms

Local Linear Model Trees

Learning Vector Quantization

Support Vector Machines

Selected Prospective Projects

· model selection

estimate complexity of a rule

or mixture density

· algorithm design

variational optimization, e.g.

alternative correlation matrix μ

j

μ

i μ

ij ξ ξ )ξ τ(FC