connectionist time and dynamic systems time in one architecture? modeling word learning at two...
TRANSCRIPT
Connectionist Time and Dynamic Systems Time in One Architecture?
Modeling Word Learning at Two Timescales
Jessica S. Horst ([email protected])Bob McMurray
Larissa K. Samuelson
Dept. of PsychologyUniversity of Iowa
Two Time Scales in Neural NetworksConnectionist and dynamical systems accounts:
• stress change over time• complement each other in timescale
Dynamic Systems: online processes
Connectionist Networks: long-term learning
Many domains of development require both timescales:
Example: language development requires • sensitivity to brief and sequential nature of the input• slower developmental processes.
Two Time Scales in Language AcquisitionWord learning often attributed to fast mapping
- quick link between a novel name and a novel object (e.g., Carey, 1978).
But, recent empirical data suggests that fast mapping and word learning may represent two distinct time scales (Horst &
Samuelson, April, 2005).
- Fast Mapping: quick process emerging in the moment.
- Word Learning: gradual process over the course of development
We capture both timescales in a recurrent network….
• Activation feed from input layers to decision layers.
• Decision units compete via inhibition.
• Activation feeds back to input layers.
• Cycle continues until system settles.c
Initial State (Before Learning)
Aud
itor
y In
puts V
isual Inputs
Decision Units (Hidden) Layer
The Architecture
(McMurray & Spivey, 2000)
• Unsupervised Hebbian learning occurs on every cycle.
• Online decision dynamics reflect auditory and visual competitors.
0 2 4 6 8 10 12 14 16 180
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Cycles
Act
iva
tion
The Model
End StatePost Learning
Intermediate StateDuring Learning
• 15 Auditory & 15 Visual units• 90 Decision units• Names presented singly with a
variable number of objects
• Name-Decision & Object-Decision associations strengthened via learning
• After 4000 training trials network forms localist representations
• Learns name-object links and to ignore visual competitors
Aud
itory
Inp
ut
1
2
3
4
5
6
7
8
9
10
Decision Units
10 20 30 40 50 60 70 80 90A
udito
ry In
pu
t
1
2
3
4
5
6
7
8
9
10
Decision Units
10 20 30 40 50 60 70 80 90
9 16 26 30 32 39 41 49 6567
0.05
0.1
0.15
0.2
Con
nect
ion
Str
engt
h
Fast: Moment by Moment• Online information integration and constraint
satisfaction (e.g., McClelland & Elman, 1986, Dell, 1981)
• Reaches a pattern of stable activation through input based on auditory and visual inputs and stored knowledge (weights)
• Model makes correct name-object links based on the latest input
Slow: Over the Long-Term• Unsupervised Hebbian Learning• Associates words with visual targets• Learns to ignore visual competitors
Two Time Scales
The two time scales are not independent
Long-term learning depends critically on the dynamics of the fast time scales
• Competition between decision units ensures pseudo-localist representations—critical for Hebbian learning (e.g. Rumelhart & Zipser, 1986)
• Learning occurs on each cycle
- Influences processing cycle-by-cycle & trial-by-trial
• Accumulated learning across trials leads to learning on long-term time scale (i.e., word learning)
Dependent Time Scales
0
0.2
0.4
0.6
0.8
1
Familiar Name Novel Name
Pro
port
ion
of C
orre
ct C
hoic
es
Chance
• 24-month-old children• Saw 2 familiar & 1 novel objects• Asked to get familiar and novel
objects (e.g., “get the cow!” or “get the yok!”)
Fast Time Scale
Cow (familiar)
Block (familiar)
Yok (novel)
• Children were excellent at fast mapping (finding the referent of novel and familiar words in the moment).
***
***
Slow Time ScaleAfter a 5-minute delay, children were asked to pick a newly fast-mapped name (e.g., “get the yok!”) Yok
(target)Fode
(named foil)unnamed foil(prev. seen)
• Children unable to retain mappings after a 5-minute delay
0
0.2
0.4
0.6
0.8
1
Familiar Name Novel Name Retention
Pro
port
ion
of C
orre
ct C
hoic
es
Chance
***
***
• Initial findings replicated with simpler tasks:• effect of number of names or trials?
• Children’s difficulty in retaining newly fast-mapped names is not related to the number of names or trials
Replication
Fast Mapping Retention
9/12 ** 4/9 n.s.
Fast Mapping Retention
7/12 * 4/7 n.s.
* Binomial, p < .05, ** Binomial, p < .01
Replication #1 (N = 12) Replication #2 (N = 12)
• 1 Novel Name• 8 Familiar Names• 7 Preference Trials
• 1 Novel Name• 2 Familiar Names
• 20 networks initialized with random weights• 15 word lexicon (names & objects):
• 5 familiar words
• 5 novel words
• 5 held out
• Trained on 5 familiar items for 5000 epochs• Items presented in random order• Run in the Fast Mapping Experiment:
• 10 fast mapping trials (5 familiar, 5 novel)
• 5 retention trials
• Learning was not turned off during experiment.
How The Model BehavesFast Time Scale:• Model succeeded on both types of fast-mapping trials• Model behavior patterned with empirical results
0
0.2
0.4
0.6
0.8
1
Familiar Name Novel Name
Pro
port
ion
of C
orre
ct C
hoic
es
******
Chance
0
0.2
0.4
0.6
0.8
1
Familiar Name Novel Name Retention
Pro
port
ion
of C
orre
ct C
hoic
es
******
Slow Time Scale:• The model fails to “retain” the newly learned words after
a “delay”
Chance
How The Model “Thinks”• Analyses of weight matrices revealed that relatively little
learning occurred during the test phase.
0 5 10 15 200
0.2
0.4
0.6
0.8
1
Cycles (novel words)0 5 10 15 20
0
0.2
0.4
0.6
0.8
1
Cycles (familiar words)
Act
ivat
ion
Act
ivat
ion
End
End
0
0.4
0.8
1.2
1.6
2
FamiliarWords
FamiliarWords
NovelWords
ControlWords
AfterLearning
After Test
Squ
ared
Dev
iati
ons
Change (RMS) in portions of weight matrix
0
0.000001
0.000002
0.000003
0.000004
0.000005
Familiar Words Novel Words Control Words
After TestS
quar
ed D
evia
tion
s
Temporal dynamics of processing
1 4 80 8666
2
4
6
8
10
12
14
10 20 30 40 50 60 70 80 90
Pri
or
to E
xperi
men
t10 20 30 40 50 60 70 80 90
2
4
6
8
10
12
14
Aft
er
Exp
eri
men
t
0.05
0.1
0.15
0.2
Con
nect
ion
Str
engt
h
• Two time scales captured in a single architecture:– Fast, online: fast mapping
– Slow, long-term: word learning
• The model replicated the empirical findings:– Excellent word learning and fast mapping
– Poor “retention”
• Has sufficient knowledge to select the referent at a given moment in time, given auditory and visual input and stored knowledge (weights).
• But not enough to subsequently “know” the word.
Conclusions
• In-the-moment learning:– Subtly biases behavior
– Combined with activation dynamics, yields correct response.
– Does not provide robust, context-independent word knowledge (in the short term)
• Continued training on fast-mapped words (i.e., 5000 epochs) makes them familiar words.
• Accumulation of this learning provides robust context-independent word knowledge over development.
Conclusions
Take-Home Messages
1) A fast-mapped word is not a known word…
…but a known word is known, because it has been fast-mapped many, many
times.
2) Understanding development requires models that integrate both short-term dynamic processes and long-term learning.
Carey, S. (1978). The child as word learner. In M. Halle, J. Bresnan & A. Miller (Eds.), Linguistic Theory and Psychological Reality (pp. 264-293). Cambridge, MA: MIT Press.
Dell, Gary S. (1986). A spreading-activation theory of retrieval in sentence production. Psychological Review, 93(3) 283-321.
Horst, J.S. & Samuelson, L.K. (2005, April). Slow Down: Understanding the Time Course Behind Fast Mapping. Poster session presented at the 2005 Biennial Meeting of the Society for Research in Child Development, Atlanta, GA.
McClelland, J. & Elman, J. (1986). The TRACE Model of Speech Perception, Cognitive Psychology, 18(1), 1-86.
McMurray, B., & Spivey, M. (2000). The Categorical Perception of Consonants: The Interaction of Learning and Processing, The Proceedings of the Chicago Linguistics Society, 34(2), 205-220.
Rumelhart, D. & Zipser, D. (1986). Feature Discovery By Competitive Learning. In Rumelhart, D., & McClelland, J. (Eds) Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 1, Cambridge, MA: MIT Press.
References
AcknowledgementsThe authors would like to thank Joseph Toscano for programming assistance and support.
This work was supported by NICHD Grant R01-HD045713 to LKS.