bucld2011
DESCRIPTION
Kurumada, C., Meylan, S.C., & Frank, M.C. (2011b). “Statistical word segmentation of Zipfian frequency distributions". Paper presented at BUCLD 36, November 5th.TRANSCRIPT
Statistical wordsegmentation ofZipfian frequencydistributions
Chigusa Kurumada Linguistics, Stanford
Stephan C. Meylan Psychology, Stanford
Michael C. Frank Psychology, Stanford
2
Segmentation of running speech
I l o ve y o u
Saffran, Newport et al.(1996); Saffran,Aslin et al. (1996) ; Jusczyk(1997);Perruchet et al. (1998); Aslin (1998),Brent (1999); Swingley (2005);Thiessenet al. (2005); Monaghan & Christiansen,(2010) among others
3
Example
Listen to a Japanese speakingmother’s speech and find “words”
4
Where is your daddy?
5
Words occur at different frequencies
6
The naturalistic word frequency distribution
Zipfiandistribution
Zipf (1965)
7
This talk
Effects of a Zipfian distribution of wordfrequencies in speech segmentation
• 2 large-scale web-based segmentation experiments
The skewed distribution supports word segmentation
• Implications for existing models
8
A potential problem for statistical word segmentation?
Pre-tty-ba-by
TP = 0.2
(Saffran, Newport, & Aslin, 1996)
(Goldwater et al., 2009)
Uniform Zipfian
TP = 1.0
9
Question 1: Is segmentation of a Zipfianlanguage more difficult?
6types
12types
24types
36types
uniform
zipfian
10
Experiment 1:Task (on Mechanical Turk)
Exposure: 300 word tokens
Subjects: 246 individuals in the 8 conditions
(6, 12, 24, 36 types * uniform/zipfian)
Test: 2 alternative forced choice task
go-la-bu la-bu-bi
11
Results1: Proportion correct in each condition
6 12 24 36 word types
6 12 24 36 word types
Uniform Zipfian
Prop
ortio
n co
rrec
t
12
Result2 : Effects of the (log) input token frequency
13
Experiment 1: Summary
The standard 2AFC paradigm
• Robust segmentation ability
• Strong effects of unigram (log) frequencies
No effects ofuniform
vs.Zipfian
14
Which one’s Daddy?Is it Daddy?That’s Daddy.Is that Daddy too?
Segmentation from the chunk-finding perspective
Chunking (Orban et al. 2008)
Bortfeld et al. (2005)
mommy’s sock familiar new
Brent & Cartwright (1996), Brent(1999), Goldwater et al. (2009),Perruchet & Vinter (1998)
Dahan & Brent (1999), Conway et al. (2010), van de Weijer(2001), Cunillera et al. (2010), Lew-Williams et al. (2011)
15
Question 2
6 9 12 24
uniform
zipfian
Is segmentation based on a Zipfiandistribution more accurate whenwords are presented in context?
16
Experiment 2: Task
Orthographic manual segmentation(50 sentences)
• words are presented in context• active search for words
Unlike the 2AFCgo-la-bu
vs.
mo-go-la • time-course of learning
17
Results1: 6 word types - Uniform
- Zipfian
- Uniform
trials
Recall
(% correct)
18
- Uniform - Zipfian
6 word types
12 word types 24 word types
9 word typesRec
all
(% c
orre
ct)
19
A mixed logit model predicting correct segmentation
LogFrequency(p<0.001)
LogFrequency(p=0.9)
LogFrequency(p<0.001)
target wordword before word after
20
Contextual bootstrapping
The average logfrequency of all thewords that appearedon the left (p<0.001)
No main effect or interaction with the distributiontypes (i.e., uniform vs. Zipfian).
The average logfrequency of all thewords that appeared onthe right (p<0.07)
target word
21
Zipfian
uniform
Experiment 2: Summary
• Clear advantage of a Zipfian distribution
• The advantage is mediated by (log) token frequency
22
Conclusion
I l o ve y o u
The Zipfian structure of natural languagesupports word recognition in context
23
Thanks to:Stanford Language Cognition Lab,Eve Clark, Tom Wasow, Dan Jurafsky, andNoah Goodman (Stanford),T. Florian Jaeger (University of Rochester),Josh Tenenbaum (MIT)
For a full text of this paper, visit theStanford Language Cognition Lab website:http://langcog.stanford.edu/publications.html
Thank you!
24
Meghan Sumner websitehttp://www.stanford.edu/~sumner/