bayesian connections: an approach to modeling aspects of the reading process david a. medler center...
TRANSCRIPT
Bayesian Connections:Bayesian Connections:An Approach to Modeling Aspects An Approach to Modeling Aspects
of the Reading Processof the Reading Process
David A. MedlerCenter for the Neural Basis of Cognition
Carnegie Mellon University
Bayesian ConnectionsBayesian Connections
• The Bayesian Approach to Psychology– How do we represent the world?
– Bayesian Connectionist Framework.
• Bayesian Generative Networks– Learning letters.
– How does context affect learning?
– Empirical and Simulation Results.
• Symmetric Diffusion Networks– The Ambiguity Advantage/Disadvantage.
• Closing Remarks
Representing the WorldRepresenting the World
• Problem: How do we form meaningful internal representations, P(H), given our observations of the external world, P(D)?
DP( )HP( )
• For a given hypothesis, H, and observed data, D, the posterior probability of H given D is computed as:
Bayesian TheoryBayesian Theory
)(
)()|()|(
DHHD
DHP
PPP
)(
)()|()|(
HDDH
HDP
PPP
where– P(H) = prior probability of the hypothesis, H– P(D) = probability of the data, D– P(D |H) = probability of D given H
Bayesian ConnectionismBayesian Connectionism
Representation LayerP(H)
Mediating Layer
P(D) Surface Layer
It was 20 years ago today...It was 20 years ago today...
An Interactive Activation Model of Context Effects in Letter Perception
James L. McClelland & David E. Rumelhart (1981; 1982)
• Word superiority effect– words > pseudowords > nonwords
• The model accounted for the time course of perceptual identification.
Interactive Activation ModelInteractive Activation Model
FeatureLevel
LetterLevel
WordLevel
Interactive Activation ModelInteractive Activation Model
FeatureLevel
LetterLevel
WordLevel
20 Years Later...20 Years Later...
• Interactive Activation (IA) Model has been influential.
• Many positives, but 20 years of negatives.
• Internal representations are hard-coded:
The Interactive Activation Model does not learn!
Bayesian ConnectionsBayesian Connections
• The Bayesian Approach to Psychology– How do we represent the world?
– Bayesian Connectionist Framework.
• Bayesian Generative Networks– Learning letters.
– How does context affect learning?
– Empirical and Simulation Results.
• Symmetric Diffusion Networks– The Ambiguity Advantage/Disadvantage.
• Closing Remarks
Bayesian Generative NetworksBayesian Generative Networks
• Initial work is an expansion of the Bayesian Generative Network framework of Lewicki & Sejnowski, 1997.
• It is an unsupervised learning paradigm for multilayered architectures.
• Simplified network equations, added sparse coding constraints, & included a “supervised” component.
P(D) Surface Layer
Representation LayerP(H)
Mediating Layer
Bayesian Generative NetworksBayesian Generative Networks
Sparse Coding ConstraintsSparse Coding Constraints
• Modified the basic framework to include “sparse coding” constraints.
• These are a Bayesian prior that constrain the types of representations learned.
• Sparse coding encourages the network to represent any given input pattern with relatively few units.
Step 1: Learning the AlphabetStep 1: Learning the Alphabet
• First stage of the IA model is the mapping between features and letters.
• We use the Rumelhart & Siple (1974) character features.
Network LearningNetwork Learning
• 16 surface units (corresponding to 16 line segments)
• 30 representation units
• Trained for 50 epochs (evaluated at 1, 10, 25 & 50)
• Evaluated:– Generative capability of the network– Internal representations formed
Generating the AlphabetGenerating the Alphabet
0
1
2
3
4
5
6
Ave
rage
"Se
gmen
t"
Err
or
1 10 25 50
Epoch
No Sparse Coding Sparse Coding
Interpreting Weight StructureInterpreting Weight Structure
Network WeightsNetwork Weights
Unit 20Unit 19Unit 18Unit 17Unit 16Unit 15Unit 14Unit 13Unit 12Unit 11
Unit 30Unit 29Unit 28Unit 27Unit 26Unit 25Unit 24Unit 23Unit 22Unit 21
Unit 10Unit 9Unit 8Unit 7Unit 6Unit 5Unit 4Unit 3Unit 2Unit 1
Unit 20Unit 19Unit 18Unit 17Unit 16Unit 15Unit 14Unit 13Unit 12Unit 11
Unit 30Unit 29Unit 28Unit 27Unit 26Unit 25Unit 24Unit 23Unit 22Unit 21
Unit 10Unit 9Unit 8Unit 7Unit 6Unit 5Unit 4Unit 3Unit 2Unit 1
No SparseCoding
SparseCoding
Epoch: 1 10 25 50
What We Have LearnedWhat We Have Learned
• In the unsupervised framework, the Bayesian Generative Network is able to learn the alphabet.
• Representations are not necessarily the same as the IA model.– distributed (not localist)– redundant (features are coded several times)
• Having learned the letters, can we now learn words?
Step 2: Learning WordsStep 2: Learning Words
• The second stage of the IA model is the mapping from letters to words.
• The IA model is able to account for the “word superiority” effect using orthographic information only.
• Interested in how the Bayesian framework accounts for development of the word superiority effect.
• Look at participants’ learning of context.
Experimental MotivationExperimental Motivation
• Our motivation for the current experiments is the word-superiority effect.
• Specifically, we draw inspiration from the Reicher-Wheeler paradigm.
KQZW--Z-
--S-+ GLUR
---P
---R+ READ
-E--
-O--+KQZW
--Z-
--S-GLUR---P
---RREAD-E--
-O--
The TaskThe Task
• The current set of studies was designed to simulate how the word superiority may develop. Specifically we were interested in:– the learning of novel, letter-like stimuli– whether stimuli were learned in parts or wholes– the effects of context on learning.
• Consequently, we created an artificial environment in which we tightly controlled context.
Experimental Design: TrainingExperimental Design: Training
• Reicher-Wheeler task is based the discrimination between two characters.
• Wanted a similar task in which context would interact with a character pair.
A
a b cd e f
p1 p2 p3
o1
o2
B
g h i j k l
p1 p2 p3
o1
o2
Experimental Design: TestingExperimental Design: Testing
• Testing: 288 Stimuli
a e c g k l
– 96 Familiar Stimuli:
j e c g k f
– 96 Crossed Stimuli:
• Total of 16 stimuli– Detect change
a b cd b ca e cd e ca b fd b fa e fd e f
Ag h i j h ig k i j k ig h l j h lg k l j k l
B
a e r g n l
– 96 Novel Stimuli:
AAABBB
BAAABB
CAACBB
• Characters were constructed from the RS features.
• Each character had six line segments with the following constraints:
StimuliStimuli
– characters were continuous
– no two segments formed a straight line
– no character was a mirror image nor rotation of another.
p1 p2 p3
o1
o2
Ap1 p2 p3
o1
o2
B
Initial SimulationsInitial Simulations
Character 1 Character 2 Character 3
18
48P(D)
16P(H)
n
iii GPTPdiff
1
)()(1
Performance was measured by computing a “differentiation value” based on the difference between the generated surface layer representation (Gi) and the target representation (Ti).
Initial Simulation ResultsInitial Simulation Results
1.00E-24
1.00E-22
1.00E-20
1.00E-18
1.00E-16
1.00E-14
1.00E-12
1.00E-10
1.00E-08
1.00E-06
1.00E-04
1.00E-02
1.00E+00
Dif
fere
ntia
tion
Vau
le
2 2wt 3 3wt 3sp 3sp/wt
Network Architecture
FamiliarCrossedNovel
Simulation ConclusionsSimulation Conclusions
• Regardless of the network architecture, all simulations showed a (slight) difference between the familiar and crossed stimuli.
• No simulation performed well on the novel stimuli in comparison to the other stimuli.
• These results are somewhat counter to what we expected.
• Is the model broken?
• How do participants perform on this task?
Stimulus PresentationStimulus Presentation
500 ms
250 ms
200 ms
250 ms
200 ms
50 ms
Data AnalysisData Analysis
• Each participant’s reaction time and proportion of “hits” and “correct rejections” were recorded.
• To correct for potential responder biases, the scores were converted to d’ scores using:
CR
Hit Miss
FA
“No”“Yes”
Differ
SameSti
mul
i
Detect Change?
d’ = ni(Hit) + ni(CR)
• 4 Participants, 10 days each
• 1440 trials per day:– 288 test trials intermixed with 1152 training
trials.
• Three conditions:– Familiar (AAA or BBB)– Crossed (BAA or ABB)– Novel (CAA or CBB)
Experiment 1: One NovelExperiment 1: One Novel
d’ Scoresd’ Scores
-1
-0.5
0
0.5
1
1.5
2
1 2 3 4 5 6 7 8 9 10
Days
d'
FamiliarCrossedNovel
d’ Scoresd’ Scores
-1
-0.5
0
0.5
1
1.5
2
1 2 3 4 5 6 7 8 9 10
Days
d'
FamiliarCrossedNovel
00.10.20.30.40.50.60.70.80.9
1
1 2 3 4 5 6 7 8 9 10
Days
Pro
port
ion
"Cha
nge"
Res
pons
e
Fam-HitFam-FACro-HitCro-FANov-HitNov-FA
Do They Report a Change?Do They Report a Change?
050
100150200250300350400450500
1 2 3 4 5 6 7 8 9 10
Days
Rea
ctio
n T
ime
(ms)
Familiar-C
Familiar-S
Cross-C
Cross-S
Novel-C
Novel-S
Reaction TimesReaction Times
Experiment ConclusionsExperiment Conclusions
• Although there is a context effect, it is not as large as we expected, nor as stable.
• There are no significant differences in reaction times for any of the conditions.
• Participants do not perform well in the Novel condition– this is due to a tendency to respond “Change”
to all novel stimuli
Re-Simulation of TaskRe-Simulation of Task
• The network was trained on the same data set that the participants were trained on.
• Network learned on all training/testing trials
• Wanted a similar measure for network performance.
• Used a variant of the Kullback-Leibler divergence measure:
n
i i
ii
i
ii yf
ygyg
yf
ygygKL
1 )1(
)1(log)1(
)(
)(log)(
0
10
20
30
40
50
60
1 2 3 4 5 6 7 8 9 10
Network "Days"
K-L
Dif
fere
nce
Mea
sure
FamiliarCrossedNovel
Simulation: Difference MeasureSimulation: Difference Measure
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10
Network "Days"
K-L
Mea
sure
Fam-HitFam-FACro-HitCro-FANov-HitNov-FA
Simulation: Report Change?Simulation: Report Change?
Internal RepresentationsInternal Representations
• If we look at the internal representations formed by the network, we get an idea of why it behaves as it does...
Unit 18Unit 17Unit 16Unit 15Unit 14Unit 13
Unit 12Unit 11Unit 10Unit 9Unit 8Unit 7
Unit 6Unit 5Unit 4Unit 3Unit 2Unit 1
Training “Day”: 1610
Simulation ConclusionsSimulation Conclusions
• The Bayesian Generative Network qualitatively matched the performance of the participants.
• Furthermore, analysis of the internal structure of the network offers an explanation for the participants’ behaviour.– The network failed to learn to represent novel items.
– Thus, if the first generated representation is garbage, and the second generated representation is garbage, then the comparison will be garbage “change”
Assessing RepresentationsAssessing Representations
• The models predicted that participants in the one novel condition would fail to learn to represent the novel items.
• Unfortunately, we can’t open up a person to see what their internal representation is.
• We can, however, ask them.– Specifically, we can test their recognition of
“novel” items following training and compare these to truly new items.
Experiment 2Experiment 2
• 10 Participants
• Trained on the same data as Experiment 1 but were only run for 2 days.
• At the conclusion of the training, participants were given a “new/old” task in which they saw the 12 old training items, the 6 old novel items, and 12 new items.
• Participants saw a single character, and made the judgement “old” or “new”.
Experiment 2: ResultsExperiment 2: Results
• Participants were about 70% correct at detecting “Old” items.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Stimulus Presented
Pro
port
ion
"Old
" R
espo
nses
OldNovelNew
• Participants were no better at recognizing old “Novel” items than truly “New” items.
Learning ContextLearning Context
• The Bayesian Generative Network is able to learn higher order information such as which characters appear in which positions.
• It is able to both simulate and explain the performance of participants trained on a contextual learning task.
• It is able to predict new findings!
• Can we expand the model?
Bayesian ConnectionsBayesian Connections
• The Bayesian Approach to Psychology– How do we represent the world?
– Bayesian Connectionist Framework.
• Bayesian Generative Networks– Learning letters.
– How does context affect learning?
– Empirical and Simulation Results.
• Symmetric Diffusion Networks– The Ambiguity Advantage/Disadvantage.
• Closing Remarks
Symmetric Diffusion NetworkSymmetric Diffusion Network
• Symmetric Diffusion Networks (SDN) are a class of networks that explicitly embody many of the implicit assumptions made be the Bayesian Generative Network.
• SDN’s can be viewed as a more general form of the Bayesian Generative Network.
Symmetric Diffusion NetworkSymmetric Diffusion Network
Representation LayerP(H)
Mediating Layer
P(D) Surface Layer
Symmetric Diffusion NetworkSymmetric Diffusion Network
Representation LayerP(H)
Mediating Layer
P(D) Surface Layer
Supervised Learning
Symmetric Diffusion NetworkSymmetric Diffusion Network
Representation LayerP(H)
Mediating Layer
P(D) Surface Layer
Unsupervised Learning
SDN RepresentationSDN Representation
• One advantage of the SDN is that it is able to learn continuous probability distributions.
• That is, it can learn multiple representations for the same input data.
• For example, the SDN is able to learn multiple meanings for the word “charge”.
The Ambiguity ParadoxThe Ambiguity Paradox
• Symmetric Diffusion Networks allow us to address the Ambiguity Paradox.
• Ambiguous words are responded to faster than unambiguous words in a lexical decision task.
• Ambiguous words are responded to slower than unambiguous words in a semantic relatedness task.
The Ambiguity AdvantageThe Ambiguity Advantage
540560580600620640660680700
Rea
ctio
n T
ime
(ms)
BM '96 PJ '00
Unambiguous Ambiguous Non-Word
chargechance
chathe
Unambiguous:Ambiguous:Non-Word:
Is it a word?
The Ambiguity DisadvantageThe Ambiguity Disadvantage
780
790
800
810
820
830
840
Rea
ctio
n T
ime
(ms)
Unambiguous Ambiguous
chargechance
chathe
Unambiguous:Ambiguous:Non-Word:
Is it a word? feeluck
thakeIs it related?
One Possible ExplanationOne Possible Explanation
• “Efficient then Inefficient” Hypothesis– Efficient: Previous models have suggested that
the ambiguity advantage results from a “blend” state (e.g., Joordens & Besner, 1994).
– Inefficient: The ambiguity disadvantage occurs in relatedness judgements because it takes longer to settle into a correct meaning
An Alternative ExplanationAn Alternative Explanation
• The Symmetric Diffusion Network offers an alternative explanation.
Ambiguity AdvantageAmbiguity Advantage
“a measure of how likely it is that some event will occur”
“a financial liability”
“a pleading describingsome wrong or offense”
“chance”“chance”
“chance”
“complaint”“complaint”
“complaint”
“tax”“tax”
“tax”
“charge”“charge”
“charge”
“Semantic Space”
Ambiguity DisadvantageAmbiguity Disadvantage
“a measure of how likely it is that some event will occur”
“a financial liability”
“a pleading describingsome wrong or offense”
“charge”“charge”
“charge”
“Semantic Space”
“complaint”
Preliminary ConclusionsPreliminary Conclusions
• Symmetric Diffusion Networks are able to learn ambiguous meanings (in contrast to other models).
• It has provided a plausible theory for the ambiguity paradox.
• It suggests new empirical studies.
• Larger network simulations are underway.
Bayesian ConnectionsBayesian Connections
• The Bayesian Approach to Psychology– How do we represent the world?
– Bayesian Connectionist Framework.
• Bayesian Generative Networks– Learning letters.
– How does context affect learning?
– Empirical and Simulation Results.
• Symmetric Diffusion Networks– The Ambiguity Advantage/Disadvantage.
• Closing Remarks
What have we learned?What have we learned?
• Introduced a class of connectionist networks that embody Bayesian principles.
• Using the IA model as inspiration, we:– Compared the letter representations learned
versus the hard-coded representations.– Simulated, explained, & predicted empirical
data on context learning.– Addressed the ambiguity paradox.
The Next 20+ YearsThe Next 20+ Years
• Continue research on learning and how it interacts with the IA model and aspects of the reading process.
• Explore the Bayesian framework and how it relates to connectionism to a fuller extent.
• Make links to neurophysiology– can we find evidence of this type of learning
and representation at the neural systems level.
The “Take Home” MessageThe “Take Home” Message
• We are able to effectively model aspects of the reading process with connectionist networks embodying Bayesian Principles!
• These networks are able to qualitatively simulate observed data.
• These networks are able to predict new findings.• Using very simple principles, these networks offer
plausible explanations for a range of behaviours.
Jay McClelland
Michael Lewicki
Tai Sing Lee
Michael Harm
David Noelle
Chris Kello
Darren Piercey
AcknowledgementsAcknowledgements