connectionist perspectives on language learning, representation …jlmcc/papers/joanissemcc15... ·...

13
Advanced Review Connectionist perspectives on language learning, representation and processing Marc F. Joanisse 1and James L. McClelland 2 The field of formal linguistics was founded on the premise that language is men- tally represented as a deterministic symbolic grammar. While this approach has captured many important characteristics of the world’s languages, it has also led to a tendency to focus theoretical questions on the correct formalization of grammat- ical rules while also de-emphasizing the role of learning and statistics in language development and processing. In this review we present a different approach to language research that has emerged from the parallel distributed processing or ’connectionist’ enterprise. In the connectionist framework, mental operations are studied by simulating learning and processing within networks of artificial neu- rons. With that in mind, we discuss recent progress in connectionist models of audi- tory word recognition, reading, morphology, and syntactic processing. We argue that connectionist models can capture many important characteristics of how lan- guage is learned, represented, and processed, as well as providing new insights about the source of these behavioral patterns. Just as importantly, the networks naturally capture irregular (non-rule-like) patterns that are common within lan- guages, something that has been difficult to reconcile with rule-based accounts of language without positing separate mechanisms for rules and exceptions. © 2015 John Wiley & Sons, Ltd. How to cite this article: WIREs Cogn Sci 2015. doi: 10.1002/wcs.1340 INTRODUCTION F ormal approaches to linguistics following Chom- sky’s Generative Grammar framework envision language representation as an assembly of symbols (e.g., words and phrases) and a grammar of rules that operates on them. 13 Questions in formal linguistics have thus tended to focus on the correct formalization of the rules, what their scope is, as well as how they are learned. In this review, we discuss a completely differ- ent approach to thinking about language representa- tions, from the parallel distributed processing (PDP) or ‘connectionist’ point of view. This perspective eschews Correspondence to: [email protected] 1 Psychology/Brain and Mind Institute, The University of Western Ontario, London, ON, Canada 2 Department of Psychology, Stanford University, Stanford, CA, USA Conflict of interest: The authors have declared no conflicts of interest for this article. the concepts of symbols and rules in favor of a model of the mind that closely reflects the functioning of the brain. As we will discuss, this approach allows us to account for a wide range of linguistic data using a much more restricted set of assumptions. We begin with a brief overview of the connectionist enterprise and its basic assumptions about how mental processes can be studied using networks of artificial neurons. OVERVIEW OF CONNECTIONISM The connectionist approach to language builds on some key guiding assumptions about the nature of mental representations 4,5 : 1. Knowledge is represented as patterns of numeri- cal activity across large sets of simple processing units: Mental states reflect the activation of neu- rons in the brain. These patterns are distributed © 2015 John Wiley & Sons, Ltd.

Upload: others

Post on 31-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review

Connectionist perspectives onlanguage learning, representationand processingMarc F. Joanisse1∗ and James L. McClelland2

The field of formal linguistics was founded on the premise that language is men-tally represented as a deterministic symbolic grammar. While this approach hascaptured many important characteristics of the world’s languages, it has also led toa tendency to focus theoretical questions on the correct formalization of grammat-ical rules while also de-emphasizing the role of learning and statistics in languagedevelopment and processing. In this review we present a different approach tolanguage research that has emerged from the parallel distributed processing or’connectionist’ enterprise. In the connectionist framework, mental operations arestudied by simulating learning and processing within networks of artificial neu-rons. With that in mind, we discuss recent progress in connectionist models of audi-tory word recognition, reading, morphology, and syntactic processing. We arguethat connectionist models can capture many important characteristics of how lan-guage is learned, represented, and processed, as well as providing new insightsabout the source of these behavioral patterns. Just as importantly, the networksnaturally capture irregular (non-rule-like) patterns that are common within lan-guages, something that has been difficult to reconcile with rule-based accounts oflanguage without positing separate mechanisms for rules and exceptions. © 2015John Wiley & Sons, Ltd.

How to cite this article:WIREs Cogn Sci 2015. doi: 10.1002/wcs.1340

INTRODUCTION

Formal approaches to linguistics following Chom-sky’s Generative Grammar framework envision

language representation as an assembly of symbols(e.g., words and phrases) and a grammar of rules thatoperates on them.1–3 Questions in formal linguisticshave thus tended to focus on the correct formalizationof the rules, what their scope is, as well as how they arelearned. In this review, we discuss a completely differ-ent approach to thinking about language representa-tions, from the parallel distributed processing (PDP) or‘connectionist’ point of view. This perspective eschews

∗Correspondence to: [email protected]/Brain and Mind Institute, The University of WesternOntario, London, ON, Canada2Department of Psychology, Stanford University, Stanford, CA, USA

Conflict of interest: The authors have declared no conflicts of interestfor this article.

the concepts of symbols and rules in favor of a modelof the mind that closely reflects the functioning of thebrain. As we will discuss, this approach allows us toaccount for a wide range of linguistic data using amuch more restricted set of assumptions. We beginwith a brief overview of the connectionist enterpriseand its basic assumptions about how mental processescan be studied using networks of artificial neurons.

OVERVIEW OF CONNECTIONISM

The connectionist approach to language builds onsome key guiding assumptions about the nature ofmental representations4,5:

1. Knowledge is represented as patterns of numeri-cal activity across large sets of simple processingunits: Mental states reflect the activation of neu-rons in the brain. These patterns are distributed

© 2015 John Wiley & Sons, Ltd.

Page 2: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review wires.wiley.com/cogsci

such that knowledge of individual concepts orcategories occurs through the activation of manyindividual processing units. Likewise, no sin-gle neuron uniquely encodes a concept or cat-egory; rather individual neurons can be re-usedto encode many different concepts.

2. Processing occurs via transformations of pat-terns of activity across large sets of connections:Neurons in the brain are massively intercon-nected. This allows information to be retrievedand processed by transforming activity amonglarge assemblies of artificial neurons.

3. Learning occurs as the confluence of innatebut domain–general architectural and learningmechanisms, plus experience: Networks learnvia changes in the strength of connectionsamong interconnected units, in response toexternal inputs (the environment). This processis governed by general laws of learning that arenot specific to any single type of process. Justas importantly, these neurons are not organizedhaphazardly. Rather they have distinct biologi-cally specified architectural characteristics thatalso influence how learning proceeds.a

A central component of the connectionist enter-prise is to develop computational simulations of keyphenomena. This allows us to make explicit assump-tions about the nature of the processes and representa-tions of interest. Implementing these into a model thenprovides an explicit test of these assumptions, as wellas a way to test hypotheses about them. In addition,the results of models provide new hypotheses that canbe tested empirically in humans.

Connectionist models encompass a number ofsimplifying assumptions that abstract away fromactual brains in some important ways; specifically theytend to contain many fewer processing units than whatone finds in the brain. In addition, these models aremade up of artificial neurons that represent rates ofneural firing as static activation levels, which changein response to inputs from the environment and fromother units. Finally, the learning mechanisms tend tobe computationally simpler than those that we knowgovern actual learning in neurons. The purpose ofthese simplifying assumptions is to create models thatcapture the assumptions laid out above, while keepingthe model sufficiently simple so as to be implementedwithin a computer program.

Connectionism Applied to LanguageThe connectionist enterprise was conceived as a wayto address a wide range of cognitive phenomena.

Just as importantly, it is seen as a unifying theory,because it assumes all types of mental knowledge canbe understood within it. Thus, it does not assume astrong distinction between language and other types ofknowledge. In this sense, connectionism is in conflictwith some of the guiding assumptions of the generativelinguistics framework, which has historically built onthe idea that language is learned and represented usingmechanisms that are distinct from those governingother types of knowledge.

Formal linguistics itself grew out of a con-cern about the learnability of language and the needto establish innate language-specific mechanisms oflearning.7 Consequently, the re-emphasis that connec-tionism places on these concepts might appear like aregression of sorts. That said, we argue here that con-nectionist mechanisms are able to learn and encodecomplex knowledge in ways that are not trivial. Aswe will show, connectionist models can capture therule-like patterns that are observed in language. Like-wise, the patterns of learning observed in these modelsalso can closely resemble the way that children learnlanguage. And importantly, the models are able to alsocapture irregular (non-rule-like) patterns that also per-vade languages without requiring a separate mecha-nism to do so.

AUDITORY WORD RECOGNITION

An important contribution of connectionist theoriesof language processing has been the idea of dynami-cism and interactivity in language processing. That is,these models lend themselves well to processes thatinvolve recognizing inputs through the interactionof bottom–up sensory information and top–downcontextual/experiential information. Here we con-sider one such phenomenon, that of spoken wordrecognition. This task can be seen as a ‘hard problem’in language processing; the listener must rapidlysegment individual words from a connected spokenutterance, and then identify them from among manydifferent competing forms. This is especially difficultgiven variability in the acoustic cues of individualphonemes (e.g., effects of coarticulation in which aphoneme is realized differently depending on whatother phonemes precede and follow it).

Earlier models of auditory word recognitionabstracted away from these issues by characterizingthe task as one of lexical access. On this view, asensory input is first broken down into its constituentphonemes, and these are then used to search for adiscrete entry within a mental lexicon.8 This is seenas a serial process in which the input is comparedagainst all known lexical forms until a matching form

© 2015 John Wiley & Sons, Ltd.

Page 3: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

WIREs Cognitive Science Connectionist perspectives on language learning

Bat Cat Rope Doctor

Words

Phonemes

Features

P

Power Voiced Acute Diffuse Grave

t a b r

FIGURE 1 | The TRACE model of auditory word recognition.

is identified (e.g., Ref 9). Importantly, this type ofserial search is proposed to be independent of theperceptual mechanisms used to map acoustic inputsonto phonetic and phonemic features.

McClelland and Elman10 set out an alternativeview, in the form of the connectionist TRACE model.b

The model consists of three layers of neurons, usedto represent auditory, phonemic, and word-specificinformation (Figure 1). It simulates word recognitionby taking input as a time-varying acoustic–phoneticrepresentation of a word, which in turn activatesthe word’s corresponding phonemes and ultimately asingle word-level unit that uniquely identifies it. So,for example, the word BAT is recognized by presentingas an input a sequence of acoustic–phonetic patternsthat correspond to this word (i.e., a numerical activitypattern corresponding to the presence or absence ofdifferent phonetic features in each of its phonemes);activation then propagates to the phonemic layerin order to activate individual units that encodethe phonemes /b/, /æ/, and /t/, and finally a singleword-level unit that uniquely represents the concept‘bat’.

A key characteristic of the model is its dynamicalnature. The network receives an auditory input thatchanges over time, rather than all the information atonce. Time is divided into discrete processing ‘cycles’in which activation is propagated from one layerof neurons to another. Thus, words are recognizedincrementally by slowly ramping up the activation ofthe correct units at the phoneme and word levels.Critically this is different from a serial search model

in which a model must search through individuallexical entries one at a time until the correct one isfound. Here all forms compete for selection in parallel.The model is also interactive; it contains connectionsthat project both bottom–up and top–down, so thatactivation at the word-level can influence activationat the phoneme level. As we show below, this hasimportant consequences for how the model processesinformation, especially in the case of ambiguous ormissing inputs.

The model provides explanations to a rangeof phenomena in speech perception. In the interestof space, we focus here on the broad category of‘lexical’ effects, which are concerned with word-levelphenomena. TRACE emphasizes the role of top–downinfluences in speech processing acting on the phonemelayer as well as the feature layer. Feedback connectionsfrom the word layer to the phoneme layer allow themodel to supplement bottom–up sensory informationwith top–down word-specific information, which hasa number of desirable consequences. Take for instancethe phoneme restoration effect11: when one of aword’s phonemes is replaced with a burst of whitenoise (e.g., the /s/ in ‘legislature’), listeners neverthelessreport hearing the missing sound. So, for example,they have difficulty reporting whether a sound hasbeen deleted and replaced by the noise or whether thenoise has simply been added to the word, confirmingthat they are experiencing an auditory illusion inwhich the missing phoneme has been restored.

The phoneme restoration effect appears tooccur because listeners use top–down information to

© 2015 John Wiley & Sons, Ltd.

Page 4: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review wires.wiley.com/cogsci

supplement the imperfect bottom–up sensory infor-mation. This can be simulated within TRACE bypresenting the model with an incomplete acousticinput. For instance, /b#ek/ (the word ‘break’ butwith the features of /r/ replaced with random noisedenoted by the # symbol) still activates three of thefour phoneme units corresponding to the target word.This in turn leads to partial activation of the correctword level unit. Words in this model have feedbackconnections that activate their constituent phonemes.In the case of the word ‘break’, this means thatthe /r/ phoneme unit can become activated throughtop–down activation from the word-level ‘break’ uniteven when the bottom–up (perceptual) information isincomplete or incorrect. As a result, the model is ableto ‘repair’ the input by activating a phoneme that wasin fact missing from the input.

Top–down effects also allow the model to dividean unsegmented input stream like /barti/ into its twoconstituent words bar and tea. However, this tendencyis weaker when the longer word is itself a familiarform. That is, the input /parti/ also tends to activate thewords par and tea, albeit to a much lesser extent thanthe longer word party. Notably, these sorts of patternsfall naturally out of the dynamics of the model, due tothe assumption in the model that word units competewith each other to the extent that they encompassoverlapping portions of the spoken input.

Lexical effects can also interact with sublexi-cal information in interesting ways. One well-studiedfinding is that listeners’ categorization of an ambigu-ous phoneme can be biased toward producing familiarwords. For example, while listeners show the usualcategorization profile for a VOT continuum betweenthe phonemes /d/ and /t/ in isolation, categorizationprofiles tend to shift if these are presented in the con-text of a familiar carrier word. For instance, presentingan alveolar stop with an ambiguous VOT in the con-text of ‘_ask’ yields a subtle bias toward categorizingit as /t/ rather than /d/.12 This occurs even when lis-teners are asked to identify the initial consonant, andignore the word it is embedded in. This effect againfalls naturally out of how TRACE identifies phonemes:partial inputs will activate the word-level representa-tion of ‘TASK’, and this projects back to the /t/ unitwithin the phoneme layer, yielding a subtle but reli-able shift in the model’s categorization curve along the/t/-/d/ continuum (Figure 2).

Recent DevelopmentsAlthough the TRACE model is now over 20 yearsold, interest in the model appears to be growing;the rate of citations of the original work have infact increased since 2001. One reason for this is the

100dask-taskdask-task

75

50

Pe

rce

nt

/d/

Re

sp

on

se

s

25

00 10 20 30 40 50

Voice Onset Time (VOT) in ms

FIGURE 2 | Lexicality effect in phoneme categorization profile ofthe TRACE model. The categorization of a midpoint (ambiguous) stopconsonant shifts as a function of the word in which it is embedded. Asin humans, the model shows a preference toward real words over anonword, but only when the phoneme’s voicing parameter is near thecategory boundary.

growing corpus of behavioral studies examining thedynamics of spoken word recognition using eyetrack-ing. Tanenhaus et al.13,14 established a ‘visual world’paradigm in which they present an array of visualobjects as subjects hear words or sentences corre-sponding to it. They have found that listeners tendto show eye movements to the corresponding objectstarting about 200 ms after hearing its name (for anoverview, see Ref 15). Strikingly, listeners also pro-duce eye movements to objects corresponding to tar-get words’ phonological competitors. So, for instance,hearing the auditory word candle yields fixations to apicture of a candle, but also to pictures of ‘candy’ and‘sandal’ (Figure 3(a)).13

What is notable is that both the timing and pro-portion of looks to target pictures and their phono-logical competitors closely matches what the TRACEmodel yields when presented with a similar task. Thatis, when the model is presented with candle, it alsotends to activate phonological competitor words likecandy and sandal (Figure 3(b)). Tanenhaus et al.16

propose that this is no coincidence, and that thereis a close link between eye movements to a givenobject and the activation of that word. As illustrated inFigure 3(b), this is captured within TRACE via differ-ing degrees of activation of the competing words overtime. Moreover, the model’s dynamics closely matcheye movement rates in humans. The target word isnot selected immediately but instead shows activationramping up over time. Concurrently, we see activationof competitor words rise and fall as a function of thedegree to which they match the provided input. This

© 2015 John Wiley & Sons, Ltd.

Page 5: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

WIREs Cognitive Science Connectionist perspectives on language learning

1.0(a) (b)Behavioral data Trace Model

1.0

0.8

0.6

0.4

0.2

0.00 200 400

Time since target onset (scaled to msec)

600 800 1000

Referent (e.g., “beaker”) Referent (e.g., “beaker”)

Cohort (e.g., “beetle”) Cohort (e.g., “beetle”)

Rhyme (e.g., “Speaker”) Rhyme (e.g., “Speaker”)

Unrelated (e.g., “Carriage”) Unrelated (e.g., “Carriage”)

0.9

0.8

0.7

0.6

0.5

Activa

tion in T

RA

CE

Pre

dic

ted fix

ation p

robabili

ty

0.4

0.3

0.2

0.1

0.00 10 20 30 40 50 60 70 80 90

Cycle

FIGURE 3 | Eye-tracking data showing competition effects from onset and rhyme competitors in a visual world paradigm. Both (a) adult listenersand (b) the TRACE model show comparable competition effects, marked by a larger proportion of eye movements to either type of phonologicallyrelated competitor relative to a phonologically unrelated foil. (Reprinted with permission from Ref 13. Copyright 2015 Elsevier).

includes an earlier effect of cohort competitors (wordsmatching the initial phonemes; candle–candy) and asomewhat later going interference effect from rhymecompetitors (candle–sandal).

The visual world paradigm has also been usedto examine other types of lexical effects relatedto word frequency and phonological neighborhooddensity.14,17 Here again, the data appear to closelymatch the predictions of TRACE, underlining the use-fulness of the model in understanding the dynamics ofauditory word recognition.

Controversy has surrounded the assumption inTRACE that lexical information really can feed backdown to the phoneme level. Some have questionedthe need for this, arguing that lexical influences canbe taken into account in a postperceptual decisionstage. One response to this has been to note thatfeedback to the phoneme level can have ‘knock-on’effects, facilitating (1) the processing of neighboringphonemes or (2) the processing of subsequent tokensof the identified phoneme itself (see Ref 18, for areview). One specific example should serve to illustratethe general form of the argument. Suppose a listenerencounters an individual with an unfamiliar dialect,in which some particular speech sound—say the / /phoneme—is pronounced in a way that is unfamiliarto the listener. Lexical context may help the listenerto identify this sound when it occurs in a context,such as /ful #/ (i.e., the word foolish ending with novelinstance of / /). If activation then flows top–downto cause the activation for the phoneme unit for / /,this could trigger the adjustment of the incoming

connections to the / / unit, so that next time the sameinput will activate / / more strongly, adapting thelistener’s network to the speaker’s unfamiliar dialect.

TRACE has also influenced how researchersunderstand word recognition processes in bilinguals.One of the key questions in this field concerns theextent to which bilingual speakers maintain separatephonological and/or lexical representations of theirtwo languages. An early model of this is Grosjean’sBIMOLA model,19 which proposed separate parallelphonological and lexical layers for two languages,receiving inputs from a common shared feature layer.This model proposes that listeners will use the acousticinputs to activate both sets of layers in parallel, andselect the correct word based on which generates thestrongest output among the two languages.

Dijkstra and Van Heuven20 have proposed acompeting model of bilingual word processing thatproposes a much weaker division between the twocompeting languages. Their model deals specificallywith reading rather than spoken word recognition, butnevertheless builds on the same principles of interac-tive activation as TRACE and BIMOLA. It proposesthat words of both languages are maintained within asingle mechanism, and are held separate in processingthanks to top–down connections from language-levelnodes. A potential benefit of this model is the abil-ity to account for the finding that bilingual individ-uals tend to show activation of competitor wordsacross languages. For instance, French/English bilin-guals show crossmodal priming effects such as fasterrecognition for the word BREAD when it is preceded

© 2015 John Wiley & Sons, Ltd.

Page 6: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review wires.wiley.com/cogsci

by the word PAIN (which is French for bread).21 Suchfindings suggest that words in the two languages arenot being held completely separately during process-ing. As the literature on bilingual word processingdevelops it is interesting to see how competing con-nectionist models can help adjudicate among differ-ent theories of how two languages are represented inone mind. Work on TRACE continues, spurred by therelease of a new implementation of the model that canbe run on modern computers (jTRACE).22 Also note-worthy is a proposal from Hannagan et al.23 of howthe TRACE model might better capture the temporalnature of speech. The original network simulated thetemporal nature of the speech signal within a staticinput scheme that presented different time points con-currently. By more accurately capturing how spokenwords unfold over time, this refinement might provideeven more fine-tuned insights into speech perceptionphenomena.

READING AND PAST TENSE

Perhaps the best-known connectionist models of lan-guage have focused on related phenomena of visualword recognition and past tense morphology. Whilethese models have arisen to deal with somewhat dif-ferent phenomena in psycholinguistics, the facts aresimilar across the two. Specifically, they are concernedwith the mechanisms by which we acquire and processthe regularities in language in parallel with the excep-tional cases that also occur. Consider the case of pasttense in English: a large majority of verbs (about 87%)are marked as past tense by adding a variant of the -edsuffix (walk–walked and need–needed). The ending isalso highly productive, such that novel forms nearlyalways take the -ed form. Thus, listeners typicallyjudge that a nonword form like wug or a neologismlike blog will take a regular ending as in wugged andblogged. On this basis, it is assumed that regular pasttense verbs are not stored outright, but rather are pro-duced using a generative rule that transforms a verbstem into a past tense form by concatenating the -edsuffix. However, there are also a number of irregularverbs in English that defy such a rule (e.g., take–took,sleep–slept, and go–went; cf. *taked, *sleeped, and*goed). A popular theory has posited separate cogni-tive mechanisms for applying a rule to regular forms,and for memorizing word-specific knowledge of irreg-ular forms.24

Reading in English presents a similar challenge.The mapping from print to sound is generally regular.For instance, words that begin with the letter B usuallyalso begin with the /b/ phoneme; similarly, words thatend in AVE tend to rhyme with each other (GAVE,

/stap/

/stapt/

Phonologicalrepresentationof past tentse

Phonologicalrepresentation

of present tense

FIGURE 4 | The Rumelhart and McClelland’s25 model of past tense.

SAVE, RAVE, CAVE, and PAVE). These regularitiescan then generalize to nonwords (MAVE) and neol-ogisms (BLOG), which suggests readers encode reg-ularities within a productive mechanism. That said,English spelling is rife with exceptions (e.g., HAVEdoes not rhyme with CAVE; the W in SWORD is silent;THROUGH, ROUGH, and DROUGHT are all pro-nounced differently).

How do we handle both the productive andexceptional aspects of language? The answer fromthe connectionist standpoint is a single distributedmechanism that encodes both within a single networkof connections. With respect to past tense, Rumelhartand McClelland25 (RM86) proposed a model designedto learn the mapping between a verb’s present andpast tense forms. The model receives as input thephonological form of a present tense verb and has toproduce, as an output, the verb’s past tense (Figure 4).The model is trained on a representative corpus ofEnglish monosyllabic verbs that includes both regularand irregular forms. It uses a learning algorithmthat adjusts connection weights based on experienceswith correct past tense forms, such that the model’sperformance gradually improves with experience.

The resulting network is able to learn both reg-ular and irregular forms within the same architec-ture, without appealing to the concepts of either ‘rules’or ‘memorized lexical entries’. The model also showsgood generalization to novel forms, suggesting it iscan take advantage of similarities in English past tense(e.g., given blug, it can produce blugged).

Seidenberg and McClelland26 (herein, SM89)have proposed a similar approach to understandingvisual word recognition. Word knowledge is modeledas the confluence of orthographic, phonological andsemantic codes, each encoded within separate layersof a network (Figure 5). Learning to read involveslearning the mapping among these three types ofknowledge. Their original instantiation focusedspecifically on mapping orthography to phonology;their model was presented with the orthographic formof an English word as input, and learned to outputits phonological form. The model was trained on a

© 2015 John Wiley & Sons, Ltd.

Page 7: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

WIREs Cognitive Science Connectionist perspectives on language learning

Meaning

Orthography Phonology

FIGURE 5 | The Seidenberg and McClelland’s26 model of reading.Portions in black depict the model as it was originally implemented.

corpus of several hundred English words, providingit with experience with both regular and irregularwords. One important innovation was the use ofa frequency-weighted training vocabulary. In thisapproach, the model was presented with differentwords at different rates, in a way that reflects the sta-tistical properties of English. This was in turn reflectedin the model’s connection strengths for different typesof patterns.

The fully trained network showed a numberof desirable patterns of behavior: it tended to havegreater difficulty with irregulars than regulars, markedby somewhat higher unit-wise differences betweenthe desired and actual output values (quantified as‘sum-squared error’ or SSE, which can be conceptu-alized as the difficulty it has in computing the correctoutput). Word frequency also influenced learning suchthat lower frequency forms tended to be more dif-ficult to produce (reflected in higher SSEs). Finally,the model showed a frequency by regularity inter-action in which the highest SSEs were observed forlow frequency irregulars compared to high frequencyirregulars and both high and low frequency regulars.This pattern closely resembles skilled readers’ reac-tion times for similar words, and suggests the modelis accurately capturing the cognitive mechanisms usedin visual word recognition.

As discussed above, both the reading and pasttense models appear to learn regular and irregularforms in parallel, and show output patterns consis-tent with what we observe in human productions.Interestingly, the way that these models learn is alsoinstructive. For instance, it has been noted that pasttense learning follows a nonlinear, U-shaped, pat-tern. Specifically, children show a tendency to produceerrors on irregular forms that they previously pro-duced correctly,27,28 and these errors often take theform of over-regularizations (e.g, *taked instead oftook). Proponents of a generative grammar perspec-tive have suggested that this pattern occurs due to theoverapplication of a rule to irregulars.24,29 On thisview, children initially use a memorization procedure

to encode both present and past tense forms in theirlexicon; later they discover the past tense rule but tendto overapply it to all forms; finally, they learn to usethe rule only for regulars, and memorize only irregularpast tenses.

Connectionist models provide a different wayof conceptualizing this process. The RM86 modelexamined the effect of changing the size of the train-ing vocabulary over time. Initially the model wastrained on a small set of high-frequency verbs, manyof them irregulars. While it showed good initial per-formance on these items, accuracy showed an initialdecline after the training vocabulary size was subse-quently increased. The reason for this was that theproportion of regular verbs in the model’s vocabularywas initially quite small. Increasing the vocabularysize also increased the ratio of regular to irregu-lar verbs, and consequently the model was able topick up on the consistency of present–past mappingsamong these regulars. One interesting consequenceis that the model tended to produce a high degreeof over-regularization errors at this point in training,similar to what is observed in children. With furthertraining, the over-regularization errors gradually dis-appeared, and the model regained high levels of accu-racy on both regulars and irregulars.

Both the SM89 and RM86 models have raiseda great deal of debate, much of it focusing on theirability to accurately capture human-like data, andthe extent to which this reflects an overall failureof the architecture or inadequacies of details. As aresult a number of follow-up models have been putforward aimed at accounting for a broader rangeof phenomena. Here we discuss a few of the keyadvances. The first of these is the use of improvedphonological representations to encode word forms.Neither the RM86 nor the SM89 models generalizedto nonwords as accurately as adult humans do. Onereason appears to be the phonological coding schemethat was used. These models used individual units toencode triplets of phonemes in a word; for instancesleep is encoded by separate units that represent ‘#sl’,‘sli’, ‘lip’, and ‘ip#’. More recent models have used adifferent approach in which a word’s phonemes aredivided in a more structured way into discrete con-sonant and vowel ‘slots’: for instance, Daugherty andSeidenberg30 used a CCVVCC coding scheme thatencodes sleep as [sli_p_], with underscores denotingunused slots. Using some variant of such a schemeyields much stronger generalization rates both inreading (e.g., Ref 31) and past tense (e.g., Ref 30).

Work growing out of the RM86 and SM89models continues. For instance, the scope of thesemodels has been expanded by including semantic

© 2015 John Wiley & Sons, Ltd.

Page 8: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review wires.wiley.com/cogsci

units that help to better account for lexical-leveleffects.32,33 In addition, some researchers have advo-cated a ‘neuroconstructivist’ approach, which seeks toincorporate the interaction between the structure ofthe input and experience-dependent reorganization ofthe neural architecture within PDP models.34 Impor-tantly, improvements in how these models account forhuman data have not been accomplished at the costof other desirable characteristics of the earlier models,which have generally been retained in the updatedinstantiations.

Also of note has been the extension of thesemodels to languages other than English. With respectto morphology, this has included work on Germanplurals,35 and noun marking in Serbian.36 Such mod-els address the extent to which learning is influ-enced by the structure of morphological systems. Forinstance, German includes many different types ofirregulars, and the regular form appears to apply in aminority of cases. Likewise, Serbian marks nouns fornumber, gender, and case, but the system as a wholeis only ‘quasi-regular’, such that no single form repre-sents a classic regular. Instead, large patterns of simi-larity exist among forms, with many exceptional casesalso integrated within these patterns. Both these sta-tistical profiles can be accommodated by connection-ist models, supporting the view that this approach isinformative about a wide range of linguistic data.

Similarly, the connectionist approach has alsobegun to provide useful insights into reading in lan-guages with different characteristics from English. Forinstance, Yang et al.37 have adapted the SM89 archi-tecture to reading in Chinese, where each symbol rep-resents an entire word or concept. Given its dissimi-larity to alphabetic languages, some have argued Chi-nese reading involves a solely lexical process in whichwords are recognized holistically without access tosublexical sound-based representations.38 Neverthe-less, closer inspection reveals the existence of phono-logical sub-units in Chinese that form semi-regularmappings between print and sound. And indeed, Yanget al.’s model does appear to take advantage of thesesomewhat hidden phonological regularities in Chi-nese. Their findings thus support the view that skilledreading of Chinese involves the same types of readingmechanisms as those used in English, and one doesnot need to assume different cognitive architectures toaccount for cross-linguistic data.

Neuropsychological Data in Childrenand AdultsThere has also been a recent resurgence in interest inmodels of reading and past tense due to the obser-vation of neuropsychological double dissociations in

processing regular and irregular forms. The read-ing literature includes several classic descriptions ofpatients with acquired alexia following brain injuryshowing a specific deficit in reading either excep-tion words (known as ‘surface’ dyslexia)39 or non-words (called ‘deep’ or ‘phonological’ dyslexia).40

Likewise, developmental dyslexia in children has alsobeen described as falling into surface/deep subtypesmarked by difficulty reading exceptions or nonwords,respectively.41

One explanation of double dissociationsassumes functional modularity, in which separateneurocognitive systems are responsible for process-ing rules and exceptions, and which are differentiallyimpaired in different syndromes. Indeed, such findingsmay at first appear inconsistent with a connectionistview, in which a single mechanism is used to encodeall forms. To address this, contemporary connectionistapproaches partially concede a certain degree of speci-ficity, while still treating all types of items in a uniformarchitecture in which damage to different parts ofthe network can differentially affect items of differenttypes. Plaut et al.31 revisited the SM89 reading model(Figure 5) by implementing it with completely inter-connected orthographic, phonological and semanticunits. They showed that different types of readingdifficulties arise as a result of damage to each ofthese groups of units, or by severing different sets ofconnections among them. For example, damage toconnections linking orthography to phonology yieldedspecific difficulty with irregulars, simulating surfacedyslexia. In contrast, damage to the phonologicallayer yielded a distinct deficit in reading nonwords,simulating phonological dyslexia. Similarly Harm andSeidenberg42 took a similar approach to simulatingdevelopmental dyslexia by implementing differenttypes of pre-existing damage to a connectionist modelof reading prior to learning.

Neuropsychological dissociations have alsobeen observed in past tense processing. For instance,Ullman et al.43 identified patients who had difficultyproducing nonword forms (which typically take aregular -ed ending), and others that had specific dif-ficulty with irregular forms. This again could suggestdamage to dissociable brain mechanisms subservingrules and exceptions, respectively. Nonetheless thesefindings also do not preclude a connectionist expla-nation. Building on the earlier work with readingmodels, Joanisse and Seidenberg33 suggested disso-ciations in past tense arise from damage to brainregions responsible for phonology or semantics. Theyproposed a model of past tense that included botha phonological component somewhat similar to theearlier SM89 model, and a semantics layer used to

© 2015 John Wiley & Sons, Ltd.

Page 9: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

WIREs Cognitive Science Connectionist perspectives on language learning

uniquely encode meanings of individual word forms.Different types of brain damage were simulated byartificially lesioning groups of units responsible forcoding either phonological or semantic knowledge.As predicted, the two types of damage yielded distinctpatterns of impairment on nonwords and irregulars,due to differences in the degree to which these types offorms rely on phonological and semantic information.

The explanation for why irregulars andnonwords can be differentially impaired in thesesimulations is as follows: although connectionistnetworks are homogeneous in their use of a connec-tionist mechanism, they do encode different types ofinformation across different sets of units and con-nections. Different forms rely to differing degreeson these types of knowledge, and consequently canbe impacted to greater or lesser degrees by damageto a specific component of the model. Specifically,nonwords rely more strongly on phonological knowl-edge due to the importance of phonology in learningspelling-to-sound consistency. In contrast, irregu-lars have an inconsistent phonological relationshipbetween present and past tense forms and thus are lesssusceptible to phonological damage. Instead, the net-work relies on support from other mechanisms (e.g.,the semantic layer in the Joanisse and Seidenberg33

past tense model) to learn the idiosyncrasies ofirregular forms’ spelling-to-sound mappings.

Several researchers have noted that connectionistmodels also provide a useful basis for understand-ing developmental deficits in forming past tenses,other word inflections, or in learning to read. Oneapproach44,45 focuses on the effects of phonolog-ical deficits that might make it difficult to detector represent aspects of speech phonology, therebyimpairing access to the subtle phonetic cues used tomark regularly inflected forms in English. The pasttense marker—often a subtle ‘t’ or ‘d’ sound addedto the end of the base wordform is particularly weakphonetically, and this may contribute to difficultymastering the regular pattern. Typically, exceptionsdiffer more from their regular counterparts, duein many cases to a vowel change, possibly alongwith other changes (e.g., see–saw and buy–bought)and would thus be less susceptible to perceptual orphonological difficulties. Another approach focuseson network characteristics that can differentiallyimpact exceptional and regular forms in single-systemmodels like the RM86 network. Indeed, Thomas andKarmiloff-Smith46 have argued that double dissoci-ations may reflect distinct anomalous distortions ofa single underlying network, differentially affectingregular and exception forms, rather than separatemechanisms for regular forms and exceptions.

CONNECTIONIST APPROACHESTO SYNTACTIC ANDSEMANTIC PROCESSING

In addition to addressing the processing of sin-gle words, connectionist approaches have also beenextended to examine syntactic and semantic process-ing. One key theme of early work47,48 was the demon-stration that fairly simple connectionist models couldlearn to rely on long-distance syntactic dependencies,such as number agreement between the head noun andmain-clause verb in a sentence like ‘The boys who sawthe girl like ice-cream’ (note that ‘like’ must agree with‘boys’ though this noun is further away from the verbthan is ‘girl’). Such phenomena had long been held todemand a domain-specific language acquisition devicepreprogrammed with knowledge of core principles oflanguage. The use of a simple connectionist model thatsimply learns to predict successive elements in wordsequences to capture such dependencies was a dra-matic departure from this thinking, and suggested thatno such device was really necessary.

Related research has built on these early suc-cesses, capturing a wide range of phenomena inon-line language processing, including the role ofword meaning as well as syntactic information incorrectly uncovering the role of relationships amongthe constituents of sentences (e.g., Ref 49). As oneexample of such a phenomenon, consider the sen-tence ‘The spy shot the policeman with the _____’.If the final word is ‘revolver’, readers interpret thisitem as the instrument used to carry out the action(shooting). If the final word is ‘binoculars’, however,they interpret this item as an object associated with(perhaps being held by) the policeman. The assign-ments reverse if ‘shot’ is replaced by ‘saw’. Manystudies show that humans are highly sensitive to theseaspects of word meaning, using them on-line to affecttheir interpretations of such sentences. Furthermore,differences in the frequency with which nouns enterinto particular roles with respect to particular verbsaffect the speed of sentence comprehension and thelikelihood of temporary misinterpretation.49 To date,most of these models have focused on processing ofsentences in normal adults, and full accounts of dis-orders of sentence processing remain to be developedwithin a connectionist framework. That said, thereare a number of connectionist models addressingboth receptive and productive aspects of the deficitsexhibited by patients suffering from aphasia.50,51

Another body of connectionist work focuses ondisorders of semantic processing in patients with acondition often called ‘semantic dementia’ (SD). Thiscondition can arise from any one of several different

© 2015 John Wiley & Sons, Ltd.

Page 10: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review wires.wiley.com/cogsci

progressive neuropathological disorders, when theseaffect a particular set of brain areas centered aroundthe anterior inferior temporal cortex.52 The disorderappears to affect the very knowledge of the thingswords refer to, and this might lead some to think sucha deficit should be excluded from a discussion of per-spectives on and models of language. However, as onewould expect from the interactive perspective inherentin the connectionist approach, this object knowledgeimpairment leads to quite a striking pattern of lan-guage impairment.

Along with a progressive loss of object knowl-edge, first affecting infrequent and atypical things, SDpatients also show a striking preference for typicality,both in words and objects.53 Given a choice between‘frute’ or ‘fruit’ in a lexical decision task, SD patientstend to choose ‘frute’—the item with the more typi-cal spelling. Similarly, given a choice between a typ-icalized elephant (one whose ear has been replacedby the more typical ear of a monkey) and a real ele-phant in an object decision task, they tend to choosethe typicalized elephant. Parallel preferences for typ-ical and linguistically regular items occur in singleword reading, past tense inflection, and other tasks.As we should expect based on the properties of theconnectionist models discussed throughout this arti-cle, corresponding deficits all arise from damage tounits or connections in a simple interactive connec-tionist network that learns from paired presentationsof visual and nonvisual semantic information aboutobjects and phonological and orthographic informa-tion about these objects’ names.54,55

DEEP LEARNING

One of the most exciting recent developments in con-nectionism comes from the applied field in which ‘deepnetworks’ are being used to classify large and com-plex datasets.56 These models involve multiple hiddenlayers that mediate the input from the output. Theyare trained using backpropagation and related algo-rithms, in a way that allows them to develop increas-ingly abstract representations of the data and thusdiscover complex and nonobvious patterns withindatasets. The successes of these networks are typi-cally discussed in terms of machine learning appli-cations, as in voice recognition57 and visual objectcategorization.58 However, these models are also beingsuccessfully applied to issues in sentence processing,including parsing and interpretation of the sentimentexpressed in a sentence.59,60 Thus, there is the strongpotential to apply deep learning mechanisms to ques-tions in psycholinguistics, and we would expect that

the results will provide useful insights into the way inwhich language is learned, represented and processed.

One challenge for this view is how we mightanalyze the organization of deep learning networks.They consist of very large sets of artificial neuronsand their organization into multiple hidden layersmight also add to their complexity. As a result, onemight suppose that understanding how and why theyproduce certain behaviors could be especially difficultif not impossible. That said, the analytic tools atour disposal are also continuing to develop, andthere have already been some proposals of ways inwhich we can understand the performance of thesesystems.61 Indeed, the concern about understandingand analyzing complex connectionist networks is onethat has been raised even from the outset. However theproblem has not proven to be insurmountable so far,and there is every reason to think that we will continueto develop appropriate analytic approaches as thesenetworks continue to scale up.

OTHER STATISTICAL APPROACHES

The connectionist enterprise represents a shift fromsymbol-based accounts of the mind to a more proba-bilistic or statistical approach. However, this is not theonly approach that takes such a view. For instance, theBayesian view of cognition also seeks to account for arange of language and cognition behaviors via sets ofprobabilistically weighted constructs.62 In some ways,the Bayesian and connectionist approaches appearcompatible, given their commitment to the idea thatbehavior can be explained via the interaction of multi-ple sources of probabilistically weighted information.

That said, there are differences between connec-tionist approaches and some Bayesian models. Forexample, many Bayesian models build in one or a sub-set of specifically structured ‘hypothesis spaces’ whileconnectionist models can instead be seen as exploringa more continuous hypothesis space that can capturea wider range of representational possibilities and canmore adequately capture the fact that natural dataonly approximates any specific structure type. In addi-tion, Bayesian models often rely on sources of infor-mation that may be abstractions far removed from theactual mechanisms that serve to explain patterns oflearning and behavior. For instance, noting that thetoken frequency of a word is an important predictorof reading times in a Bayesian model has, in our view,much less explanatory value than showing the learn-ing mechanism in a PDP model that explains preciselywhy these frequency effects emerge. For further discus-sion of this issue, see discussion in Jones and Love,63

and McClelland et al.64

© 2015 John Wiley & Sons, Ltd.

Page 11: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

WIREs Cognitive Science Connectionist perspectives on language learning

CONCLUSIONS

In this review, we have presented an overview ofconnectionist modeling of language, focusing bothon early efforts and more recent developments. Theapproach provides a distinctive perspective on lan-guage learning and representation—one that is rele-vant not only to language processing as it occurs intypical adults but also to acquisition and to disor-ders of language processing. Specifically, it eschewsideas of domain-specific representational and learningmechanisms, and suggests instead that we can under-stand language phenomena using a simple set of cog-nitive principles. The approach also seeks to put the‘learning’ back into language learning phenomena byproviding a mechanism that discovers complex lin-guistic patterns thanks to statistically structured pat-terns in the input. The approach explains the ‘shapeof change’—nonlinearities observed in developmentemerge naturally out of the way in which these types ofmodels learn information—as well as deficits that canarise from effects of damage after learning or anoma-lies in the characteristics of the developing system.Connectionist models also address a wide range ofphenomena in sentence processing, and have provenespecially useful in modeling semantic learning anddisorders of knowledge representations in SD.

A distinct challenge of connectionism is the con-cern that one must become a modeler in order toincorporate into one’s research program the theoret-ical assumptions of the connectionist enterprise. It isbecoming increasingly clear that this is not the case.

Instead, many new behavioral studies of languagelearning, processing and impairment have begun toincorporate many of the principles of connection-ism. Especially noteworthy is the increasing interestin studies of how statistical learning allows infants torapidly acquire phonological categories,65 learn to seg-ment words from the continuous speech stream66 andto analyze the distributional properties of these wordsto acquire syntactic representations.67

NOTESa Although we would argue these are fundamen-tal guiding principles of the connectionist enterprise,there is variability in the extent to which individ-ual models reflect each of them. For instance withrespect to assumption 1, some models do use localistrepresentations in which single units represent wholeconcepts or categories rather than the distributed rep-resentations in which items are represented by ensem-bles of units that also participate in representing otheritems. In some cases, these are simplifying assump-tions made in the interest of keeping models computa-tionally tractable. In other cases,6 this localist codingscheme reflects a strong theoretical claim about thenature of mental representations.b Note that McClelland and Elman proposed twoimplementations of TRACE to account for somewhatdifferent phenomena. Here we focus on the imple-mentation that was named ‘TRACE-II’ in the originalwork.

ACKNOWLEDGMENTS

MFJ is supported by operating grants from the Canadian Institutes of Health Research and the Natural Sciencesand Engineering Research Council (Canada).

REFERENCES1. Chomsky N. On certain formal properties of grammars.

Inf control 1959, 2:137–167.

2. Newmeyer FJ. Language Form and Language Function.Cambridge, MA: MIT Press; 1998.

3. Pinker S, Jackendoff R. The faculty of language: What’sspecial about it? Cognition 2005, 95:201–236.

4. McClelland JL, Rumelhart DE, The PDP ResearchGroup. Parallel Distributed Processing: Explorations inthe Microstructure of Cognition, vol. II. Cambridge,MA: MIT Press; 1986.

5. Smolensky P. Grammar-based connectionist approachesto language. Cognit Sci 1999, 23:589–613.

6. Bowers JS, Vankov II, Damian MF, Davis CJ. Neuralnetworks learn highly selective representations in orderto overcome the superposition catastrophe. Psychol Rev2014, 121:248–261.

7. Piatelli-Palmarini M. Language and Learning: TheDebate between Jean Piaget and Noam Chomsky.London: Routledge & Kegan Paul; 1980.

8. Forster KI. Accessing the mental lexicon. In: Wales RJ,Walker E, eds. New Approaches to Language Mecha-nisms. Amsterdam: North-Holland; 1976, 257–287.

9. Marslen-Wilson W. Functional parallelism in spokenword recognition. Cognition 1987, 25:71–102.

© 2015 John Wiley & Sons, Ltd.

Page 12: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

Advanced Review wires.wiley.com/cogsci

10. McClelland JL, Elman JL. The TRACE model of speechperception. Cogn Psychol 1986, 18:1–86.

11. Warren RM. Perceptual restoration of missing speechsounds. Science 1970, 167:392–393.

12. Ganong WF. Phonetic categorization in auditory per-ception. J Exp Psychol Hum Percept Perform 1980,6:110–125.

13. Allopenna PD, Magnuson JS, Tanenhaus MK. Trackingthe time course of spoken word recognition: evidencefor continuous mapping models. J Mem Lang 1998,38:419–439.

14. Dahan D, Magnuson JS, Tanenhaus MK. Time course offrequency effects in spoken-word recognition: evidencefrom eye movements. Cogn Psychol 2001, 42:317–367.

15. Tanenhaus MK, Spivey-Knowlton MJ, Eberhard KM,Sedivy JC. Integration of visual and linguistic informa-tion in spoken language comprehension. Science 1995,268:1632–1634.

16. Tanenhaus MK, Magnuson JS, Dahan D, Chambers C.Eye movements and lexical access in spoken-languagecomprehension: evaluating a linking hypothesisbetween fixations and linguistic processing. J Psy-cholinguist Res 2000, 29:557–580.

17. Magnuson JS, Tanenhaus MK, Aslin RN, Dahan D.The microstructure of spoken word recognition: stud-ies with artificial lexicons. J Exp Psychol Gen 2003,132:202–227.

18. McClelland JL, Mirman D, Holt LL. Are there inter-active processes in speech perception? Trends Cogn Sci2006, 10:363–369.

19. Grosjean F. Exploring the recognition of guest words inbilingual speech. Lang Cogn Proc 1988, 3:233–274.

20. Dijkstra A, Van Heuven WJB. The BIA model and bilin-gual word recognition. In: Grainger J, Jacobs A, eds.Localist Connectionist Approaches to Human Cogni-tion. Hillsdale, NJ: Lawrence Erlbaum; 1998, 189–225.

21. Beauvillain C, Grainger J. Accessing interlexical homo-graphs: some limitations of a language-selective access.J Mem Lang 1987, 26:658–672.

22. Strauss TJ, Harris HD, Magnuson JS. jTRACE: areimplementation and extension of the TRACE modelof speech perception and spoken word recognition.Behav Res Methods 2007, 39:19–30.

23. Hannagan T, Magnuson JS, Grainger J. Spoken wordrecognition without a TRACE. Front Psychol 2013,4:563. doi:10.3389/fpsyg.2013.00563.

24. Pinker S. Rules of language. Science 1991, 253:530–535.

25. Rumelhart DE, McClelland JL. On learning the pasttenses of English verbs. In: McClelland JL, RumelhartDE, The PDP Research Group, eds. Parallel DistributedProcessing: Explorations in the Microstructure of Cog-nition, vol. II Chapter 18. Cambridge, MA: MIT Press;1986, 216–271.

26. Seidenberg MS, McClelland JL. A distributed, develop-mental model of word recognition and naming. PsycholRev 1989, 96:523–568.

27. Bybee JL, Slobin DI. Rules and schemas in the develop-ment and use of the English past tense. Language 1982,58:265–289.

28. Kuczaj S. The acquisition of regular and irregular pasttense forms. J Verbal Learning Verbal Behav 1977,16:589–600.

29. Marcus GF, Pinker S, Ullman M, Hollander M, RosenTJ, Xu F, Clahsen H. Overregularization in lan-guage acquisition. Monogr Soc Res Child Dev 1992,57:1–178.

30. Daugherty K, Seidenberg MS. Rules or connections?The past tense revisited. In: Proceedings of the 14thAnnual Conference of the Cognitive Science Society.Hillsdale, NJ: Lawrence Erlbaum; 1992, 259–264.

31. Plaut DC, McClelland JL, Seidenberg M, PattersonKE. Understanding normal and impaired word reading:computational principles in quasi-regular domains. Psy-chol Rev 1996, 103:56–115.

32. Harm MW, Seidenberg MS. Computing the meanings ofwords in reading: cooperative division of labor betweenvisual and phonological processes. Psychol Rev 2004,111:662–720.

33. Joanisse MF, Seidenberg MS. Impairments in verbmorphology after brain injury: a connectionist model.Proc Natl Acad Sci USA 1999, 96:7592–7597.

34. Westermann G, Ruh N. A neuroconstructivist modelof past tense development and processing. Psychol Rev2012, 119:649–667.

35. Hahn U, Nakisa RC. German inflection: Single route ordual route? Cogn Psychol 2000, 41:313–360.

36. Mirkovic J, Seidenberg MS, Joanisse MF. Probabilisticnature of inflectional structure: insights from a highlyinflected language. Cognit Sci 2011, 35:638–681.

37. Yang J, McCandliss BD, Shu H, Zevin JD. Simulatinglanguage-specific and language-general effects in a sta-tistical learning model of Chinese reading. J Mem Lang2009, 61:238–257.

38. Zhou X, Marslen-Wilson W. The nature of sublexicalprocessing in reading Chinese characters. J Exp PsycholLearn Mem Cogn 1999, 25:819–837.

39. Patterson KE, Marshall JC, Coltheart M. SurfaceDyslexia: Neuropsychological and Cognitive Studiesof Phonological Reading. London: Lawrence Erlbaum;1985.

40. Beauvois MF, Derouesné J. Phonological alexia: Threedissociations. J Neurol Neurosurg Psychiatry 1979,42:1115–1124.

41. Manis F, Seidenberg M, Doi L, McBride-Chang C,Peterson A. On the basis of two subtypes of develop-mental dyslexia. Cognition 1996, 58:157–195.

© 2015 John Wiley & Sons, Ltd.

Page 13: Connectionist perspectives on language learning, representation …jlmcc/papers/JoanisseMcC15... · 2015-02-03 · Advanced Review Connectionist perspectives on language learning,

WIREs Cognitive Science Connectionist perspectives on language learning

42. Harm MW, Seidenberg MS. Phonology, reading acqui-sition, and dyslexia: insights from connectionist models.Psychol Rev 1999, 106:491–528.

43. Ullman M, Corkin S, Coppola M, Hickok G, GrowdonJH, Koroshetz WJ, Pinker S. A neural dissociationwithin language: evidence that the mental dictionaryis part of declarative memory, and that grammaticalrules are processed by the procedural system. J CognNeurosci 1997, 9:289–299.

44. Leonard L. Children with Specific Language Impair-ment. Cambridge, MA: MIT Press; 1998.

45. Tallal P, Miller S, Fitch R. Neurobiological basis ofspeech: a case for the preeminence of temporal process-ing. In: Tallal P, Galaburda AM, Llinas RR, von EulerC, eds. Temporal Information Processing in the NervousSystem: Special Reference to Dyslexia and Dysphasia.New York, NY: New York Academy of Sciences; 1993,27–47.

46. Thomas MSC, Karmiloff-Smith A. Modelling languageacquisition in atypical phenotypes. Psychol Rev 2003,110:647–682.

47. Elman JL. Finding structure in time. Cognit Sci 1990,14:179–211.

48. Elman JL. Distributed representations, simple recur-rent networks, and grammatical structure. Mach Learn1991, 7:195–224.

49. MacDonald MC, Pearlmutter NJ, Seidenberg MS. Thelexical nature of syntactic ambiguity resolution. PsycholRev 1994, 101:676–703.

50. Dell GS, Schwartz MF, Martin N, Saffran EM, GagnonDA. Lexical access in aphasic and nonaphasic speakers.Psychol Rev 1997, 104:801–838.

51. Gotts SJ, Plaut DC. Connectionist approaches to under-standing aphasic perserveration. Semin Speech Lang2004, 25:323–334.

52. Pereira JMS, Williams GB, Acosta-Cabronero J, PengasG, Spillantini MG, Xuereb JH, Hodges JR, Nestor PJ.Atrophy patterns in histologic vs. clinical groupings offrontotemporal lobar degeneration. Neurology 2009,72:1653–1660.

53. Rogers TT, Lambon Ralph MA, Hodges J, PattersonK. Object recognition under semantic impairment: theeffects of conceptual regularities on perceptual deci-sions. Lang Cogn Proc 2003, 18:625–662.

54. Dilkina K, McClelland JL, Plaut DC. A single-systemaccount of semantic and lexical deficits in five seman-tic dementia patients. Cogn Neuropsychol 2008,25:136–164.

55. Rogers TT, Lambon Ralph MA, Garrard P, Bozeat S,McClelland JL, Hodges JR, Patterson K. The structureand deterioration of semantic memory: a neuropsycho-logical and computational investigation. Psychol Rev2004, 111:205–235.

56. Hinton G, Osindero S, Teh YW. A fast learning algo-rithm for deep belief nets. Neural Comput 2006,18:1527–1554.

57. Mohamed A, Dahl G, Hinton G. Deep belief net-works for phone recognition. Science 2009, 4:1–9.doi:10.4249/scholarpedia.5947.

58. Le QV, Ranzato MA, Monga R, Devin M, Chen K,Corrado GS, Dean J, Ng AY. Building high-level featuresusing large scale unsupervised learning. In: Proceedingsof the International Conference on Machine Learning,Edinburgh, Scotland, June 26–July 1, 2012.

59. Socher R, Bauer J, Manning CD, Ng AY. Parsingwith compositional vector grammars. In: Proceed-ings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), Sofia, Bulgaria, August, 2013, 455–465. Asso-ciation for Computational Linguistics. Available at:http://www.aclweb.org/anthology/P13-1045. (AccessedDecember 1, 2014).

60. Socher R, Perelygin A, Wu JY, Chuang J, Man-ning CD, Ng AY, Potts C. Recursive deep modelsfor semantic compositionality over a sentimenttreebank. In: Proceedings of the 2013 Conferenceon Empirical Methods in Natural Language Pro-cessing, Seattle, WA, October, 2013, 1631–1642.Association for Computational Linguistics. Avail-able at: http://www.aclweb.org/anthology/D13-1170.(Accessed December 1, 2014).

61. Saxe AM, McClelland JL, Ganguli S. Learning hierar-chical category structure in deep neural networks. In:Knauff M, Paulen M, Sebanz N, Wachsmuth I, eds. Pro-ceedings of the 35th Annual Meeting of the CognitiveScience Society. Austin, TX: Cognitive Science Society;2013, 1271–1276.

62. Chater N, Manning CD. Probabilistic models of lan-guage processing and acquisition. Trends Cogn Sci2006, 10:335–344.

63. Jones M, Love BC. Bayesian fundamentalism or enlight-enment? On the explanatory status and theoretical con-tributions of Bayesian models of cognition. Behav BrainSci 2011, 34:169–231.

64. McClelland JL, Botvinick MM, Noelle DC, PlautDC, Rogers TT, Seidenberg MS, Smith LB. Lettingstructure emerge: connectionist and dynamical systemsapproaches to understanding cognition. Trends CognSci 2010, 14:348–356.

65. Maye J, Werker JF, Gerken L. Infant sensitivity to dis-tributional information can affect phonetic discrimina-tion. Cognition 2002, 82:B101–B111.

66. Saffran JR, Aslin RN, Newport EL. Statistical learningby 8-month-old infants. Science 1996, 274:1296–1298.

67. Chemla E, Mintz TH, Bernal S, Christophe A.Categorizing words using frequent frames: whatcross-linguistic analyses reveal about distributionalacquisition strategies. Dev Sci 2009, 12:396–406.

© 2015 John Wiley & Sons, Ltd.