linguistic structure in aggregate variationnerbonne/teach/rema-stats-meth-seminar/presentati… ·...
TRANSCRIPT
Linguistic Structure in AggregateVariation
John NerbonneRijksuniversiteit Groningen
Summer, 2005
Aggregation in Variation
Thesis: Language variation must be studied in the aggregate.
• Detailed studies of single features ([aI] vs. [a], [æ] vs. [æ@])are at best inconclusive, at worst misleading.
• Bloomfield (1933) noted how confusing details are; Coseriu(11956,1975) warned against “atomism” in dialectology.
• But question: is the aggregate linguistically structured?
We focus here on the question of linguistic structure.
1
Outline
• Question
• Aggregating Technique
• Experiment on Southern Vowels in LAMSAS
• Results
• Reflections
2
Question
Aggregate pronunciation distance:
• Is reliable, given > 20 pronunciations/site (Cronbach α > 0.8)
• Correlates with naive speakers’ judgements (r ≈ 0.65)Gooskens & Heeringa (2003), Heeringa (2004: Chap. 7)
• Is predictable from geography (Heeringa & Nerbonne, 2001)
• Provides analytic foundation for dialect continua as organizingprinciple
But there’s little assumption of linguistic structure in this work.
Question: What linguistic elements determine aggregatepronunciation distance (if any)?
3
Factor Analysis
• Extract from correlation matrix those elements which reliablycorrelate
• Used in social science research to find common (underlying),e.g., in questionnaires
– Check reactions to local dialect vs. standard– Status factor: intelligence, education, knowledgeable– Sympathy factor: honest, sympathetic, unpretentious
• Leading idea: examine correlations among linguisticvariables, extract commonalities
4
Material
• Separate LAMSAS material into roughly 200 vowelpronunciations
– first vowel in <Alabama>, last vowel in <good morning>
• For each vowel, for each pair of sites, measure distance invowel pronunciation
– use LAMSAS feature chart as basis for distance
• Given that factor analysis will identify vowel occurences thatfunction similarly (in distinguishing sites), the linguistichypothesis is that these will reflect linguistic structure(phonemic identities, phonological processes).
5
Sites Grouped to Complete Matrices
999
1717
171717
1729
292929
17171744
4
44
44426
264
13
2626
26
1313
131313
13131313 4
6136
666 29
292923
232323 1616
242424
24
16
2424
2424
3 333
16161616
16
279927
2727
2714
1427
99
17272727
9
99
32
321414
14
25
14
5514
525 5
5555
1111
3 33
52525252525
2511
11 11 111121
3232
1919 1915 19
15 19 19 19
9
329
328
3232
328
8 32
15 1515
1526 15
2
26
10
26
18 1818
21212111 21
2121
3021 21
2130
1212
3030
77
121212
77
7
77
77
7
1119
3030
227
7
19
22
19
22222
2
22
222
2020
2
2020
2020
20
313131 31
10
1010
1010
10
28
28
28
2828
281
11
6
Site Matrices
Per vowel we obtain a distance matrix (site × site):
Wheeling Winston Raleigh Richmond CharlotteWheeling 0 41 44 45 46Winston 41 0 16 34 36Raleigh 44 16 0 37 38Richmond 45 34 37 0 20Charlotte 46 36 38 20 0
We then derive for each pair of vowels, the correlationcoefficient, i.e., the degree to which they indicate the samedistance between sites.
7
Vowel Matrix
Per vowel-pair we obtain correlation coefficient (vowel × vowel)correlations:
morning1 Tuesday2 pallet2 thunderstorm2 first1morning1 1 0.02 −0.01 0.73 0.056
Tuesday2 1 0.23 −0.03 0.02
pallet2 1 0.006 0.09
thunderstorm2 1 0.043
first1 1
This CORRELATION MATRIX is analysed for COMMON FACTORS.
We used varimax as an estimation procedure (in R): onlyorthogonal, no oblique rotations.
Condition: KCM/Bartlett’s test of sphericity (variables aresufficiently distinct): p < 0.001
8
Loadings
1 3 5 7 9 11 13 15 17 19
Factor Analysis
Factor
SS
load
ings
010
2030
9
Importance of Factors
1 3 5 7 9 11 13 15 17 19
Factor Analysis
Factor
Pro
port
ion
Var
0.00
0.05
0.10
0.15
10
Extreme Factor Loadings
1
2 3Arkansas 1Georgia 1New York 2forty 1wardrobe 1good morning 2thunderstorm 3Saturday 2
weatherboarding 2weatherboarding 3
Virginia 1hundred 2
thunderstorm 2
year ago 2
yesterday 2
Sunday before last 4
froze over 3
twice as good 2
Maryland 1
January 3
South Carolina 2
St. Louis 2
Cincinnati 2
February 3
Baltimore 2
Tennessee 2Tuesday 1Florida 1
Missouri 2closet 2
Massachusetts 4
what tim
e is it 4hundred 1kitchen 2pallet 2
white ashes 3N
ew Y
ork 1C
hicago 1draining 2
Sunday before last 2
Sunday week 2
foggy 2Thursday 2
Wednesday 2
Tuesday 2
Saturday 3
forty 2
thirty 2
Georgia 2
yest
erda
y 2
Ark
ansa
s 1
hund
red
2
Geo
rgia
1
thun
ders
torm
3
weath
erbo
ardi
ng 2
Satur
day
2
good
mor
ning
2
New Y
ork 2
Virgini
a 1forty 1
thunderstorm
2
wardrobe 1
year ago 2
weatherboarding 3
Sunday before last 4
New
Eng
land
1N
ew Y
ork
1T
uesd
ay 1
Flo
rida
1S
t. Lo
uis
2hu
ndre
d 1
Ten
ness
ee 2
Cin
cinn
ati 2
Bal
timor
e 2
Mas
sach
uset
ts 4
clos
et 2
kitc
hen
2M
isso
uri 2
palle
t 2C
hica
go 1
Sout
h Car
olin
a 2
white
ash
es 3
what t
ime
is it 4
Febru
ary 3
drain
ing 2
January
3
good morning 1
good day 1
Sunday before last 2Sunday week 2Tuesday 2Wednesday 2Thursday 2Saturday 3thirty 2forty 2Georgia 2
foggy 2
11
Extremes on Factors 3 & 2
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
Factor 3
Fac
tor
2
Alabama4
Cincinnati2February2
February3
Georgia2
January3Massachusetts1
Missouri1
Russia2
South_Carolina2
Tennessee3
Washington1
a_little_ways3
backlog2
bottom1
canal2
chair1
chimney2
clearing_up1
driven1
dry_spell1
dry_spell2
eight1
first1
frost1
good_morning3
half_past_seven3
hearth1
night1
one1
rising2
rose1
seventy1
sixth1
steady_drizzle2 steady_drizzle3
sundown2
thirty1
thunderstorm1
twenty−seven2
twenty1
twenty2
weatherboarding4
white_ashes1
yesterday3
12
Extremes on Factors 3 & 1
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
Factor 3
Fac
tor
1
April2
Arkansas2
Baltimore1
Baltimore2 Chicago1Cincinnati2
February2
February3
Florida2
Florida3
January3
Kentucky2
Kentucky3
Massachusetts1
Missouri1
North_Carolina2
Russia1
Russia2
Sunday_before_last1
Sunday_before_last2
Sunday_week1
Sunday_week2
Tennessee3
Washington2
Wednesday2
bottom1
chair1
chimney2
closet2
draining2
dry_spell1
eight1
first1
five1
frost1
good_day1
good_morning1
good_morning3 gully1
gully2
he_died_with2
hearth1
hog_pen1
kitchen2
my_wife1
pallet2
rising2
roof1seventy2
soot1
steady_drizzle1
steady_drizzle3
sundown1
sunup1
sunup2
thirty1
thunderstorm1twenty1 twenty2
twice_as_good3
umbrella1
umbrella3
what_time_is_it4
white_ashes1
white_ashes3
13
Extremes on Factors 3 & 2
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.8
Factor 2
Fac
tor
3
Alabama4
Cincinnati2
February2
February3
Georgia2
January3
Massachusetts1
Missouri1
Russia2
Tennessee3
Washington1a_little_ways3
backlog2bottom1
canal2
chair1
chimney2
driven1dry_spell1
dry_spell2
eight1first1
frost1
half_past_seven3
hearth1
night1
one1
rising2
rose1
seventy1
sixth1
steady_drizzle2
steady_drizzle3
sundown2
thirty1
thunderstorm1
twenty1
twenty2
weatherboarding4
white_ashes1
yesterday3
14
Extremes on Factors 3 & 2
0.0 0.2 0.4 0.6
0.0
0.2
0.4
0.6
0.8
Factor 2
Fac
tor
3
Alabama4
Cincinnati2
February2
February3
Georgia2
January3
Massachusetts1
Missouri1
Russia2
Tennessee3
Washington1a_little_ways3
backlog2bottom1
canal2
chair1
chimney2
driven1dry_spell1
dry_spell2
eight1first1
frost1
half_past_seven3
hearth1
night1
one1
rising2
rose1
seventy1
sixth1
steady_drizzle2
steady_drizzle3
sundown2
thirty1
thunderstorm1
twenty1
twenty2
weatherboarding4
white_ashes1
yesterday3
15
Factor 1 Loadings
closet2 0.884 kitchen2 0.880pallet2 0.874 white ashes3 0.869Tennessee2 0.856 Cincinnati2 0.851Baltimore2 0.844 Massachusetts4 0.830Chicago1 0.816 draining2 0.812
[@] vs. [1]
Florida1 0.842 [O] vs. [A] St. Louis2 0.821 [u] vs. [0]hog pen1 0.585 [O] vs. [A] Tuesday1 0.796 [u] vs. [0]
Missouri2 0.857 [0@] vs. [0@ffl]
16
Factor 1: Geography
17
Phonological Alternations Factor 1
[@] vs [1] [O] vs [A] [u] vs. [0]
Conclusion
• The first factor is sensitive to phonological alternations alongthe North-South division
18
Factor 2 Loadings
weatherboarding2 0.936 Saturday2 0.926Virginia1 0.905
[Vr] vs. V] (including [Ä] vs. [@])
good morning2 0.929 New York2 0.922forty1 0.906 thunderstorm3 0.893
[O@] vs. [Oˇ @]
19
Factor 2: Geography
20
Phonological Alternations Factor 2
[Ä] vs. [@] [O@] vs. [Oˇ @]
• The second factor is sensitive to alternations distinguishingthe Piedmont area, especially the absence of syllable final [r].
• Does [r]-lessness promote the lowering of [O]?
21
Factor 3 Loadings
Wednesday2 0.967 Saturday3 0.961thirty2 0.928 foggy2 0.854
[1ˆ ] vs. [1]
Georgia2 0.876 Tennessee1 0.766sofa2 0.760 good day1 0.775Russia2 0.751 good morning1 0.738
[@] vs. [1] (!)[E] vs. [Eˆ ][Ú] vs. [Úffl ]
22
Factor 3: Geography
23
Phonological Alternations Factor 3
[1ˆ ] vs. [1] [E] vs. [Eˆ ] [Ú] vs. [Úffl ]
• Only the [1ˆ ] vs. [1] distinction seems to pick out West Virginiaas opposed to Virginia, North Carolina, Maryland, andDelaware.
24
Noncontrasting Vowels (in Factor Analysis)
he died with1 April2 seven2 kitchen1 Chicago3he died with3 France1 twelve1 January2 Louisiana3New England2 Missouri3 bureau1 St. Louis1 February1Sunday week3 attic2 ten1 second2 all at once1half past seven1 backlog1 bottom2 froze over1 Alabama2what time is it1 chimney1 driven1 dry spell1 dry spell2New Orleans2 fourteen2 broom1 froze over2 Tennessee3half past seven2 eleven2 mantel1 hog pen2 Charleston2Sunday before last5 my wife2 night1 northeast2 northwest2steady drizzle1 quilt1 rose1 second1 a little ways2twenty-seven1 seventy1 sofa1 tomorrow1 Washington3twenty-seven2 three1 pallet1 January1 Baltimore1twenty-seven3 thirteen2 twenty1 wardrobe2 bureau2white ashes2
25
Tentative Conclusions
• Linguistic structure is exploited in dialectal distinctions. Forexample, phonemic distinctions are consistent across lexicalitems.
• Factor analysis effectively identifies linguistic structure inmass comparision
• The technique is enabled by the numeric measure of distancebetween segments.
• Total explained variance is low, only 36% in the first threefactors. Data is noisy.
• Some factors link non-trivial linguistic variations, e.g., [@] vs.[1] on the one hand with [E] vs. [Eˆ ] on the other
26
Future Work
• Identifying which variations to focus on (e.g., [@] vs. [1]) wrt agiven factor is subjective. Can we systematize this?
• Can this technique suggest deeper linguistic relationships,e.g., different concrete alternations that are loaded for thesame factor?
• Are there more general, e.g., data-mining techniques, thatcould be used to probe in data for which no numericalmeasure of difference has been established?
27