connectionist time and dynamic systems time in one architecture? modeling word learning at two...

Connectionist Time and Dynamic Systems Time in One Architecture?

Modeling Word Learning at Two Timescales

Jessica S. Horst ([email protected])Bob McMurray

Larissa K. Samuelson

Dept. of PsychologyUniversity of Iowa

Two Time Scales in Neural NetworksConnectionist and dynamical systems accounts:

• stress change over time• complement each other in timescale

Dynamic Systems: online processes

Connectionist Networks: long-term learning

Many domains of development require both timescales:

Example: language development requires • sensitivity to brief and sequential nature of the input• slower developmental processes.

Two Time Scales in Language AcquisitionWord learning often attributed to fast mapping

- quick link between a novel name and a novel object (e.g., Carey, 1978).

But, recent empirical data suggests that fast mapping and word learning may represent two distinct time scales (Horst &

Samuelson, April, 2005).

- Fast Mapping: quick process emerging in the moment.

- Word Learning: gradual process over the course of development

We capture both timescales in a recurrent network….

• Activation feed from input layers to decision layers.

• Decision units compete via inhibition.

• Activation feeds back to input layers.

• Cycle continues until system settles.c

Initial State (Before Learning)

Aud

itor

y In

puts V

isual Inputs

Decision Units (Hidden) Layer

The Architecture

(McMurray & Spivey, 2000)

• Unsupervised Hebbian learning occurs on every cycle.

• Online decision dynamics reflect auditory and visual competitors.

0 2 4 6 8 10 12 14 16 180

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Cycles

Act

iva

tion

The Model

End StatePost Learning

Intermediate StateDuring Learning

• 15 Auditory & 15 Visual units• 90 Decision units• Names presented singly with a

variable number of objects

• Name-Decision & Object-Decision associations strengthened via learning

• After 4000 training trials network forms localist representations

• Learns name-object links and to ignore visual competitors

Aud

itory

Inp

ut

1

2

3

4

5

6

7

8

9

10

Decision Units

10 20 30 40 50 60 70 80 90A

udito

ry In

pu

t

1

2

3

4

5

6

7

8

9

10

Decision Units

10 20 30 40 50 60 70 80 90

9 16 26 30 32 39 41 49 6567

0.05

0.1

0.15

0.2

Con

nect

ion

Str

engt

h

Fast: Moment by Moment• Online information integration and constraint

satisfaction (e.g., McClelland & Elman, 1986, Dell, 1981)

• Reaches a pattern of stable activation through input based on auditory and visual inputs and stored knowledge (weights)

• Model makes correct name-object links based on the latest input

Slow: Over the Long-Term• Unsupervised Hebbian Learning• Associates words with visual targets• Learns to ignore visual competitors

Two Time Scales

The two time scales are not independent

Long-term learning depends critically on the dynamics of the fast time scales

• Competition between decision units ensures pseudo-localist representations—critical for Hebbian learning (e.g. Rumelhart & Zipser, 1986)

• Learning occurs on each cycle

- Influences processing cycle-by-cycle & trial-by-trial

• Accumulated learning across trials leads to learning on long-term time scale (i.e., word learning)

Dependent Time Scales

Empirical Results

0

0.2

0.4

0.6

0.8

1

Familiar Name Novel Name

Pro

port

ion

of C

orre

ct C

hoic

es

Chance

• 24-month-old children• Saw 2 familiar & 1 novel objects• Asked to get familiar and novel

objects (e.g., “get the cow!” or “get the yok!”)

Fast Time Scale

Cow (familiar)

Block (familiar)

Yok (novel)

• Children were excellent at fast mapping (finding the referent of novel and familiar words in the moment).

***

***

Slow Time ScaleAfter a 5-minute delay, children were asked to pick a newly fast-mapped name (e.g., “get the yok!”) Yok

(target)Fode

(named foil)unnamed foil(prev. seen)

• Children unable to retain mappings after a 5-minute delay

0

0.2

0.4

0.6

0.8

1

Familiar Name Novel Name Retention

Pro

port

ion

of C

orre

ct C

hoic

es

Chance

***

***

• Initial findings replicated with simpler tasks:• effect of number of names or trials?

• Children’s difficulty in retaining newly fast-mapped names is not related to the number of names or trials

Replication

Fast Mapping Retention

9/12 ** 4/9 n.s.

Fast Mapping Retention

7/12 * 4/7 n.s.

* Binomial, p < .05, ** Binomial, p < .01

Replication #1 (N = 12) Replication #2 (N = 12)

• 1 Novel Name• 8 Familiar Names• 7 Preference Trials

• 1 Novel Name• 2 Familiar Names

Simulations

• 20 networks initialized with random weights• 15 word lexicon (names & objects):

• 5 familiar words

• 5 novel words

• 5 held out

• Trained on 5 familiar items for 5000 epochs• Items presented in random order• Run in the Fast Mapping Experiment:

• 10 fast mapping trials (5 familiar, 5 novel)

• 5 retention trials

• Learning was not turned off during experiment.

How The Model BehavesFast Time Scale:• Model succeeded on both types of fast-mapping trials• Model behavior patterned with empirical results

0

0.2

0.4

0.6

0.8

1

Familiar Name Novel Name

Pro

port

ion

of C

orre

ct C

hoic

es

******

Chance

0

0.2

0.4

0.6

0.8

1

Familiar Name Novel Name Retention

Pro

port

ion

of C

orre

ct C

hoic

es

******

Slow Time Scale:• The model fails to “retain” the newly learned words after

a “delay”

Chance

How The Model “Thinks”• Analyses of weight matrices revealed that relatively little

learning occurred during the test phase.

0 5 10 15 200

0.2

0.4

0.6

0.8

1

Cycles (novel words)0 5 10 15 20

0

0.2

0.4

0.6

0.8

1

Cycles (familiar words)

Act

ivat

ion

Act

ivat

ion

End

End

0

0.4

0.8

1.2

1.6

2

FamiliarWords

FamiliarWords

NovelWords

ControlWords

AfterLearning

After Test

Squ

ared

Dev

iati

ons

Change (RMS) in portions of weight matrix

0

0.000001

0.000002

0.000003

0.000004

0.000005

Familiar Words Novel Words Control Words

After TestS

quar

ed D

evia

tion

s

Temporal dynamics of processing

1 4 80 8666

2

4

6

8

10

12

14

10 20 30 40 50 60 70 80 90

Pri

or

to E

xperi

men

t10 20 30 40 50 60 70 80 90

2

4

6

8

10

12

14

Aft

er

Exp

eri

men

t

0.05

0.1

0.15

0.2

Con

nect

ion

Str

engt

h

• Two time scales captured in a single architecture:– Fast, online: fast mapping

– Slow, long-term: word learning

• The model replicated the empirical findings:– Excellent word learning and fast mapping

– Poor “retention”

• Has sufficient knowledge to select the referent at a given moment in time, given auditory and visual input and stored knowledge (weights).

• But not enough to subsequently “know” the word.

Conclusions

• In-the-moment learning:– Subtly biases behavior

– Combined with activation dynamics, yields correct response.

– Does not provide robust, context-independent word knowledge (in the short term)

• Continued training on fast-mapped words (i.e., 5000 epochs) makes them familiar words.

• Accumulation of this learning provides robust context-independent word knowledge over development.

Conclusions

Take-Home Messages

1) A fast-mapped word is not a known word…

…but a known word is known, because it has been fast-mapped many, many

times.

2) Understanding development requires models that integrate both short-term dynamic processes and long-term learning.

Carey, S. (1978). The child as word learner. In M. Halle, J. Bresnan & A. Miller (Eds.), Linguistic Theory and Psychological Reality (pp. 264-293). Cambridge, MA: MIT Press.

Dell, Gary S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3) 283-321.

Horst, J.S. & Samuelson, L.K. (2005, April). Slow Down: Understanding the Time Course Behind Fast Mapping. Poster session presented at the 2005 Biennial Meeting of the Society for Research in Child Development, Atlanta, GA.

McClelland, J. & Elman, J. (1986). The TRACE Model of Speech Perception, Cognitive Psychology, 18(1), 1-86.

McMurray, B., & Spivey, M. (2000). The Categorical Perception of Consonants: The Interaction of Learning and Processing, The Proceedings of the Chicago Linguistics Society, 34(2), 205-220.

Rumelhart, D. & Zipser, D. (1986). Feature Discovery By Competitive Learning. In Rumelhart, D., & McClelland, J. (Eds) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, Cambridge, MA: MIT Press.

References

AcknowledgementsThe authors would like to thank Joseph Toscano for programming assistance and support.

This work was supported by NICHD Grant R01-HD045713 to LKS.

connectionist time and dynamic systems time in one architecture? modeling word learning at two...

Documents

dependent time scales

modeling word learning

connectionist time

independent longterm

fast time scales competition

trial accumulated learning

visual units

longterm time scale