learning linguistic structure with simple recurrent networks
DESCRIPTION
Learning linguistic structure with simple recurrent networks. February 20, 2013. Elman’s Simple Recurrent Network (Elman, 1990). What is the best way to represent time? Slots? Or time itself? What is the best way to represent language? Units and rules? Or connectionist learning? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/1.jpg)
Learning linguistic structure with simple recurrent networks
February 20, 2013
![Page 2: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/2.jpg)
Elman’s Simple Recurrent Network (Elman, 1990)
• What is the best way to represent time?
• Slots?
• Or time itself?
• What is the best way to represent language?
• Units and rules?
• Or connectionist learning?
• Is grammar learnable?
• If so, are there any necessary constraints?
![Page 3: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/3.jpg)
The Simple Recurrent Network
• Network is trained on a stream of elements with sequential structure
• At step n, target for output is next element.
• Pattern on hidden units is copied back to the context units.
• After learning the network comes to retain information about preceding elements of the string, allowing expectations to be conditioned by an indefinite window of prior context.
![Page 4: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/4.jpg)
Learning about words from streams of letters (200 sentences of 4-9 words)
Similarly, SRNs have also been used to model learning to segment words in speech (e.g., Christiansen, Allen and Seidenberg, 1998)
![Page 5: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/5.jpg)
Learning about sentence structure from streams of
words
![Page 6: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/6.jpg)
Learned and imputed hidden-layer representations (average vectors over all contexts)
‘Zog’ representationderived by averagingvectors obtained byinserting novel item in place of each occurrence of ‘man’.
![Page 7: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/7.jpg)
Within-item variation by context
![Page 8: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/8.jpg)
Analyis of SRN’s using Simpler Sequential Structures (Servain-Schreiber, Cleeremans, & McClelland)
The Grammar The Network
![Page 9: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/9.jpg)
Hidden unit representationswith 3 hidden units
True Finite State Machine
GradedStateMachine
![Page 10: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/10.jpg)
Training with Restricted Set of Strings
21 of the 43 valid strings of length 3-8
![Page 11: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/11.jpg)
Progressive Deepening of the Network’s Sensitivity to Prior Context
Note: Prior Context is only maintained if it is prediction-relevant at intermediate points.
![Page 12: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/12.jpg)
Elman (1991)
![Page 13: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/13.jpg)
NV Agreementand Verb successor prediction
• Histograms show summed activation for classes of words:– W = who– S = period– V1/V2 / N1/N2/PN
indicate singular, plural, or proper
– For V’s:• N = No DO• O = Optional DO• R = Required DO
![Page 14: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/14.jpg)
Prediction withan embedded clause
![Page 15: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/15.jpg)
PCA components representing agreement and Verb Argument Constraints
![Page 16: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/16.jpg)
Components trackingconstituents withinclauses of differenttypes.
![Page 17: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/17.jpg)
Role of Prediction Relevance of Head in Carrying Context Across an Imbedding
• If the network at right is trained with symmetrical embedded strings, it does not reliably carry the prior context through the embedding (and thus fails to correctly predict the final letter esp. for longer embeddings).
• If however subtle asymmetries on transitional probabilities are introduced (as shown) performance predicting the correct letter after emerging from the embedding becomes perfect (although very long strings were not tested).
• This happens because the initial context ‘shades’ the representation as shown on the next slide.
![Page 18: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/18.jpg)
Hidden unit reps in thenetwork trained on theasymmetrical embeddedsub-grammars.
Representations of same internal sequencein different sub-grammars is more similar than different sequences in the same subgrammar
- The model is capturing the similarity of nodes across the two sub-grammars
- Nonetheless, it is able to shade these representations in order to allow it to predict the correct final token
![Page 19: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/19.jpg)
Importance of Starting Small?• Elman (1993) found that his 1991 network did not
learn a corpus of sentences with lots of embeddings if the training corpus was stationary from the start.
• However, he found that training went much better if he either:– Started with only simple sentences and gradually increased
the fraction of sentences with embedded clauses– Started with a limited memory (erasing the context) after 3
time steps, then gradually increasing the interval between erasures.
• This forced the network to ‘start small’, and seemed to help learning.
![Page 20: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/20.jpg)
A Failure of Replication
• Rhode and Plaut revisited ‘Starting Small’.
• They considered the effects of adding semantic constraints.
• They also used different training parameters; with Elman’s the network appeared to settle into local minima.
![Page 21: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/21.jpg)
Grammar and Semantic Constraints
Complex regimen: 75% of sentences contain embeddings throughout.Simple regimen: Start without embeddings, increment in steps up to 75%
![Page 22: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/22.jpg)
• Rohde and Plaut generally found an advantage for ‘starting big’:
– Performance was generally better using the final corpus in which 75% of sentences contain embeddings from the start (Complex regimen), compared to starting with only simple sentences and gradually increasing % of embeddings (Simple regimen).
• An advantage for starting small only occurred when:
– The final corpus contained embeddings in 100% of sentences.
– Semantic and agreement constraints between head noun and embedded verb were both completely eliminated (Corpus A’).
Conditions A-E are ordered by proportion of sentences in which semantic constraints operate between head noun and subordinate clause (from 0 to 1.0).
In A through E above, the subordinate verb always agrees in number with the head noun where appropriate. Not so in A’
A’100
![Page 23: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/23.jpg)
Effect of initial weight range (Elman used +/- .001)
![Page 24: Learning linguistic structure with simple recurrent networks](https://reader034.vdocuments.us/reader034/viewer/2022051518/5681306f550346895d964f29/html5/thumbnails/24.jpg)
24
Discussion
• Specific questions about the SRN:–Can it be applied successfully to other tasks?–Is its way of representing context psychologically
realistic?–Can something like it be scaled up to address
languages with large vocabularies?• More general questions about language
–Is language acquisition a matter of learning a grammar?
–Are innate constraints required to learn it?