bucld2011

Statistical wordsegmentation ofZipfian frequencydistributions

Chigusa Kurumada Linguistics, Stanford

Stephan C. Meylan Psychology, Stanford

Michael C. Frank Psychology, Stanford

2

Segmentation of running speech

I l o ve y o u

Saffran, Newport et al.(1996); Saffran,Aslin et al. (1996) ; Jusczyk(1997);Perruchet et al. (1998); Aslin (1998),Brent (1999); Swingley (2005);Thiessenet al. (2005); Monaghan & Christiansen,(2010) among others

3

Example

Listen to a Japanese speakingmother’s speech and find “words”

4

Where is your daddy?

5

Words occur at different frequencies

6

The naturalistic word frequency distribution

Zipfiandistribution

Zipf (1965)

7

This talk

Effects of a Zipfian distribution of wordfrequencies in speech segmentation

• 2 large-scale web-based segmentation experiments

The skewed distribution supports word segmentation

• Implications for existing models

8

A potential problem for statistical word segmentation?

Pre-tty-ba-by

TP = 0.2

(Saffran, Newport, & Aslin, 1996)

(Goldwater et al., 2009)

Uniform Zipfian

TP = 1.0

9

Question 1: Is segmentation of a Zipfianlanguage more difficult?

6types

12types

24types

36types

uniform

zipfian

10

Experiment 1:Task (on Mechanical Turk)

Exposure: 300 word tokens

Subjects: 246 individuals in the 8 conditions

(6, 12, 24, 36 types * uniform/zipfian)

Test: 2 alternative forced choice task

go-la-bu la-bu-bi

11

Results1: Proportion correct in each condition

6 12 24 36 word types

6 12 24 36 word types

Uniform Zipfian

Prop

ortio

n co

rrec

t

12

Result2 : Effects of the (log) input token frequency

13

Experiment 1: Summary

The standard 2AFC paradigm

• Robust segmentation ability

• Strong effects of unigram (log) frequencies

No effects ofuniform

vs.Zipfian

14

Which one’s Daddy?Is it Daddy?That’s Daddy.Is that Daddy too?

Segmentation from the chunk-finding perspective

Chunking (Orban et al. 2008)

Bortfeld et al. (2005)

mommy’s sock familiar new

Brent & Cartwright (1996), Brent(1999), Goldwater et al. (2009),Perruchet & Vinter (1998)

Dahan & Brent (1999), Conway et al. (2010), van de Weijer(2001), Cunillera et al. (2010), Lew-Williams et al. (2011)

15

Question 2

6 9 12 24

uniform

zipfian

Is segmentation based on a Zipfiandistribution more accurate whenwords are presented in context?

16

Experiment 2: Task

Orthographic manual segmentation(50 sentences)

• words are presented in context• active search for words

Unlike the 2AFCgo-la-bu

vs.

mo-go-la • time-course of learning

17

Results1: 6 word types - Uniform

- Zipfian

- Uniform

trials

Recall

(% correct)

18

- Uniform - Zipfian

6 word types

12 word types 24 word types

9 word typesRec

all

(% c

orre

ct)

19

A mixed logit model predicting correct segmentation

LogFrequency(p<0.001)

LogFrequency(p=0.9)

LogFrequency(p<0.001)

target wordword before word after

20

Contextual bootstrapping

The average logfrequency of all thewords that appearedon the left (p<0.001)

No main effect or interaction with the distributiontypes (i.e., uniform vs. Zipfian).

The average logfrequency of all thewords that appeared onthe right (p<0.07)

target word

21

Zipfian

uniform

Experiment 2: Summary

• Clear advantage of a Zipfian distribution

• The advantage is mediated by (log) token frequency

22

Conclusion

I l o ve y o u

The Zipfian structure of natural languagesupports word recognition in context

23

Thanks to:Stanford Language Cognition Lab,Eve Clark, Tom Wasow, Dan Jurafsky, andNoah Goodman (Stanford),T. Florian Jaeger (University of Rochester),Josh Tenenbaum (MIT)

For a full text of this paper, visit theStanford Language Cognition Lab website:http://langcog.stanford.edu/publications.html

Thank you!

24

Meghan Sumner websitehttp://www.stanford.edu/~sumner/

bucld2011

Documents

word types uniform

statistical word segmentation

word segmentation implications

word typesrecall

word tokenssubjects

speech segmentation

uniformzipfianis segmentation

stanford stephan