gradience and similarity in sound, word, phrase and meaning jay mcclelland stanford university

Gradience and Similarity in Sound, Word, Phraseand Meaning

Jay McClelland

Stanford University

Collaborators

Dave Rumelhart Mark Seidenberg Dave Plaut Karalyn Patterson Matt Lambon Ralph Cathy Harris Gary Lupyan Lori Holt Brent Vander Wyk Joan Bybee

The Compositional View of Language (Fodor and Pylyshyn, 1988) Linguistic objects may be atoms or

more complex structures like molecules.

Molecules consist of combinations of atoms that are consistent with structural rules.

Mappings between form and meaning depend on structure-sensitive rules.

This allows languages to be combinatorial, productive, and systematic.

[ John [ hit [the ball] ] ] [ [w [ ei [t]]] [^d]

S NP, VPVP V NP; NP …

word stem+affix stem {syl}+syl’+{syl} syl {onset} + rhyme

rhyme nuc + {coda}

Subj Agent Verb Action Obj Patient

Vi+past stemi + [^d]

Critique

The number of units present in an expression is not always clear

The number of different categories of units is not at all clear

Real native ‘idiomatic’ language ability involves many subtle patterns not easily captured by rules

There is no generally accepted framework for characterizing how rules work

How many mountains?

There is less discreteness in some cases than others

And more in some domains than to others

Some cases in language where it is hard to decide on the number of units How many words?

Cut out, cut up, cut over; cut it out? Barstool, shipmate; another, a whole nother

How many morphemes? Pretend, prefer, predict, prefabricate Chocoholic, chicketarian Strength, length; health, wealth; dearth, filth

How many syllables? Every, memory, livery; leveling, shoveling; evening…

How many phonemes? Teach, boy, hint, swiftly, softly Memory, different What happened to you?

Cases in which it is unclear how many types of units are needed Object types:

Species California redwoods Butterflies along a

mountain range Types of tomatoes

Restaurants Japanese Italian Seafood

Linguistic types Word meanings

ball run

Segment types fuse, fusion dirt, dirty (cf. sturdy)

Characterizations of how rules work

Rule or exception (Pinker et al) V + past Stem + /^d/ gowent; dig dug;

keep kept; say said General and specific rules (Halle, Marantz)

V + past Stem +/^d/ if stem ends in ‘eep’: ‘ee’ ‘eh’ if stem = say: ‘ay’ ‘eh’

Output oriented approaches OT: e.g. ‘No Coda’ Bybee’s output oriented past tense schemas

A lax vowel followed by a dental, as in hit, cut, bid, waited ‘ah’ or ‘uh’ followed by a (preferably nasalized) velar

as in (sang, flung, dug …)

How do the general and the specific work together? Past Tenses

likeliked but keepkeptpaypaid but saysaid

English spellingsound mappingmint, hint, … but pintsave, wave, … but have

Meanings of sentencesJohn saw a dogJohn saw a doctor

Can the contexts of application of the more specific patterns be well defined? For the past tense

Generally, words with more complex rhymes will be more susceptible to reduction *VV[S]t where [S] stands for stop consonant

Item frequency and number of other similar items both appear to contribute

For spelling to sound Sources of spelling are lost in history But item frequency and similar neighbors play important roles

For constructions Characterization of constraints is generally relatively vague and seems to be a matter

of degree Subj: Human V: saw Obj: Professional ‘Paid a visit to’ John saw an accountant John saw an architect The baby saw a doctor The boy saw a doctor

Perhaps similarity to neighbors plays an important role here as well

Summary

Linguistic objects vary continuously in their degree of compositionality and in their degree of systematicity

While some forms seem highly compositional and some forms seem highly regular/systematic, there is generally a detectable degree of specificity in every familiar form (Goldberg)

Even nonce forms reflect specific effects of specific ‘neighbors’

It may be useful to adopt the notion that language consists of tokens selected from a specified taxonomy of units and that linguistic mappings are determined by systems of rules…

BUT, an exact characterization is not possible in this framework

Units and rules are meta-linguistic constructs which do not play a role in language processing, language use or language acquisition.

These constructs impede understanding of language change

What will the alternative look like?

It will be a system that allows continuous patterns over time

-- articulatory gestures and auditory waveforms

to generate graded and distributed internal representations that capture linguistic structure and mappings

in ways that respect both

the continuous and discrete aspects of linguistic structure

without enumeration of unitsexplicit representation of rules

Using neural network models to capture these ideas

Units in Neural Network Models

Many neural network models rely on distributed internal representations in which there is no discrete representation of linguistic units.

To date most of these models have adopted some sort of concession to units in their inputs and outputs.

We do this because we have not yet achieved the ability to avoid doing so, not because we believe these units exist

A Connectionist Model of Word Reading (Plaut, McC, Seidenberg & Patterson, 1996)

Task is to learn to map spelling to sound, given spelling-sound pairs from 3000 word corpus.

Network learns gradually from frequency weighted exposure to pairs in the corpus.

For each presentation of each item: Input units corresponding to spelling

are activated. Processing occurs through

propagation of activation from input units through hidden units to output units, via weighted connections.

Output is compared to the item’s pronunciation.

Small adjustments to connections are made to reduce difference.

M I N T

/m/ /I/ /n/ /t/

Aspects of the Connectionist Model

Mapping through hidden units forces network to use overlapping internal representations.

-Allows sensitivity to combinations if necessary

-Yet tends to preserve overlap based on similarity

Connections used by different words with shared letters overlap, so what is learned tends to transfer across items.

M I N T

/m/ /I/ /n/ /t/

Processing Regular Items: MINT and MINE

Across the vocabulary, consistent co-occurrence of M with /m/, regardless of other letters, leads to weights linking M to /m/ by way of the hidden units.

The same thing happens with the other consonants, and most consonants in other words.

For the Vowel I: If there’s a final E produce /ai/ Otherwise produce /I/

M I N T

/m/ /I/ /n/ /t/

Processing an Exception: PINT

Because PINT overlaps with MINT, there’s transfer Positive for N -> /n/ and T -> /t/ Negative for I -> /ai/

Of course P benefits from learning with PINK, PINE, POST, etc.

Knowledge of regular patterns is hard at work in processing this and all other exceptions.

The only special thing the network needs to learn is what to do with the vowel.

Even this will benefit from weights acquired from cases such as MIND, FIND, PINE, etc.

P I N T

/p/ /ai/ /n/ /t/

Model captures patterns associated with ‘units’ of different scopes without explicitly representing them.

The model learns basic regular correspondences, generalizes appropriately to non-words. mint, rint; seat, reat; rave, mave…

It learns to produce the correct output for all exceptions in the corpus. pint, bread, have, etc…

It is sensitive to sub-regularities such as special vowels with certain word-final clusters, c-conditioning, final-e conditioning… sold, nold; book, grook;

plead, tread, ?klead bake, dake; rage, dage / rice, bice

Shows graded sensitivity modulated by frequency to item-specific, rhyme-specific, and context-sensitive correspondences.

pint

bread

hint

dent

High LowFrequency

Err

or /

Set

tling

Tim

e

How does it work? Correspondences of different

scopes are represented in the connections between the input and the output that depends on them.

Some correspondences, e.g. in the word-initial consonant cluster, are highly compositional, and the model treats them this way.

Others, such as those involving the pronunciation of the vowel, are highly dependent on context, but to a degree that varies by with the type of item.

Elman’s Simple Recurrent Network

Finds larger units with coherent internal structure from time series of inputs.

Series are usually discretized at conventional linguistic unit boundaries, but this is just for simplicity.

Uses hidden unit state from processing of previous input as context for next input.

Elman networks learn syntactic categories from word sequences

Elman (1991) Explored Long-DistanceDependencies

NV Agreement and Verb successor prediction

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Prediction withan embedded clause

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Swho

Vp

Vs

N

Attractor Neural Networks

Advantages Discreteness as well as

continuity Captures general and

specific in a single network for semantic as well as spelling-sound regularity

General information is learned faster and is more robust to damage, capturing development and learning

Adding context would allow context to shade or select meaning

Can we do without units on the input and the output? I think it will be crucial to do so because speech

gestures are continuous.They have attractor-like characteristics but

also vary continuously in many ways and as a function of a wide range of factors

It will then be entirely up to the characteristics of the processing system to exhibit the relevant partitioning into units

Keidel’s model that learns to translate from continuous spoken input to articulatory parameters.

The input to the model is a time series of auditory parameters from actual spoken CV syllables.

Output is the identity of the C and the V, but…

It should be possible to translate from auditory input to the continuous articulatory movements that would ‘imitate’ the input. An important future direction

Units and Rules as Emergents

In all three example models, units and rules are emergent properties that admit of matters of degree.

We can choose to talk about such things as though they have an independent existence for descriptive convenience but they may have no separate mechanistic role in language processing, language learning, language structure, or language change.

Although many models use ‘units’ in their inputs and outputs, the claim is that this is a simplification that actually limits what the model can explain.

Beyond the Phone and the Phoneme

Some additional problems with the notions of phonetic segment.

Model of gradual language change exhibiting pressure to be regular and to be brief.

Just a Few of the Problems with Segments in Phonology Enumeration of segment types is fraught with problems.

No universal inventory; there are cross-language similarities of segments but every segment is different in every language (Pierrehumbert, 2001).

When we speak the articulation of the same “segment” depends on Phonetic context Word frequency and familiarity Degree of compositionality, which in turn depends on frequency Number of competitors Many other aspects of context…

Presence/absence of aspects of articulation is a matter of degree. Nasal ‘segment’, release burst, duration /degree of approximation to closure

in l’s, d’s and t’s… Language change involves a gradual process of reduction/adjustment.

Segments disappear gradually, not discretely. What is it half way through the change?

The approach misses out on some of the global structure of spoken language that needs to be taken into account in any theory of phonology.

A model of language change that produces irregular past tenses (with Gary Lupyan)

Our initial interest focused on quasi-regular exceptions: Items that add /d/ or /t/ and reduce the vowel:

Did, made, had, said, kept, heard, fled…

Items already ending in /d/ or /t/ that change (usually reduce) the vowel:

hid, slid, sat, read, bled, fought..

We suggest these items reflect historical change sensitive to: Pressure to be brief contingent on comprehension Consistency in mapping between sound and meaning

Two constraints on communication

The spoken form I produce is constrained:

To allow you to understand

To be as short as possible given that it is understood.

My understanding of what you said

My Intended Meaning

Speech

Your understanding of what I said

Your Intended Meaning

Simplified version of this actually explored by Lupyan and McClelland (2003)

The network has a Phonological word pattern Corresponding semantic pattern

For present and past tense forms of 739 verbs It is trained with the phonological word form as

input, and this is used to produce a semantic pattern.

The error at the output layer is back-propagated allowing a change in the connection weights.

The error is also back-propagated to the input units, and is used to adjust the phonological word pattern.

There is also a pressure on the phonological word form representation to be simpler, depending on how well the utterance was understood (summed error at the output units).

The improved phonological word form is then stored in the list.

Your understanding of what I said

What I say whenI want to communicate a particular message.

Model Details: L&M Simulation 2a Semantic patterns

‘Quasi-componential’ representations of tense plus base word meaning are created, based on including tense information in the feature vectors passed through the encoder network.

The representation of past tense varies somewhat from word to word.

Phonological patterns have one unit per phoneme but long vowels or diphthongs have an extra unit, plus a unit for the syllabic ‘ed’. Initialized with binary values (0,1).

Although units still stand for phonemes, presence/absence is a matter of degree.

Learning rate for the representation is slow relative to learning rate for the weights.

739 monosyllabic verbs, frequency weighted. Training corpus is fully regularized at the start of the simulation.

Simulation of Reductive Irregularization Effects

In English, frequent items are less likely to be regular.

Also, d/t items are less likely to be regular.

The same effects emerge in the simulation.

While the past tense is usually one phoneme longer than present, this is less true for the high frequency past tense items.

Reduction of high frequency past tenses is to a phoneme other than the word final /d/ or /t/. Regularity and role in mapping

to meaning protects inflection.

Further Simulations

Simulation 2b showed that when irregulars were present in the training corpus, the network tended to preserve their irregularity.

In ongoing work an extended model shows a tendency to regularize low-frequency exceptions.

Simulation 2c used fully componential semantic representation of past tense, resulting in much less tendency to reduce.

Discussion and Future Directions

The work discussed here is a small example of what needs to be accomplished, even for a model of phonology.

Extending the approach to continuous speech input will be a big challenge

Extending continuous speech to full sentences as input and output will be a bigger challenge still

Neural network approaches are gaining promenance as processing power grows, and these things will be increasingly possible

It will still be useful to notate specific linguistic units, but machines will not need these to communicate – no more that our minds need them to speak and understand.

gradience and similarity in sound, word, phrase and meaning jay mcclelland stanford university

Documents

past tenseslikeliked

past tensegenerally

dif stem

rules workrule

structural rules

specific rules halle

rules workhow

dcritiquethe number