sign and signal deriving linguistic …wg646gh4444/urielcohenpriva-dissertation...sign and signal...
TRANSCRIPT
.SIGN AND SIGNAL
DERIVING LINGUISTIC GENERALIZATIONS FROM
INFORMATION UTILITY
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF LINGUISTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Uriel Cohen Priva
August 2012
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
This dissertation is online at: http://purl.stanford.edu/wg646gh4444
© 2012 by Uriel Cohen Priva. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Daniel Jurafsky, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Arto Anttila
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Paul Kiparsky
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Christopher Manning
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Meghan Sumner
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Why do languages have such different phonological processes even though all speakers
share the same cognitive, articulatory and perceptual constraints? American English
preserves sounds such as /p/ and /g/ even though they are absent from the sound
systems of many of the world’s languages, but reduces sounds such as /t/ even though
it is one of the most frequently used sounds cross-linguistically. In contrast, Romance
languages reduce /s/, which American English preserves. What makes American
English have this particular set of phonological processes and not processes that
affect other languages?
I show that by assuming that speakers attempt to maximize the amount of infor-
mation they transmit while minimizing the amount of effort required to transmit that
information, it is possible to determine which sounds are more likely to be affected
by reduction processes and which sounds are more likely to be preserved in each lan-
guage. Unlike cognitive, perceptual and articulatory constraints, which are the same
for speakers of all languages, the amount of information languages assign to linguistic
elements, such as individual sounds, varies markedly. The more information a sound
carries, the more effort speakers are willing to expend to transmit it faithfully to
listeners. The trade-off between maximizing information and minimizing effort forms
the basis for a new framework I call MULE (Most information Utility, Least Effort).
MULE predicts preservation and reduction patterns in English and Arabic at the
levels of performance, competence and change, thereby providing a partial answer to
the actuation problem (Weinreich et al. 1968). MULE also predicts cross-linguistic
generalizations. I show that in American English, Egyptian Arabic and Spanish,
highly informative sounds are more likely to benefit from the perceptual prominence
iv
of the onsets of stressed syllables. Similarly the balance between effort and informa-
tion successfully predicts cross-linguistic asymmetries between the frequencies of less
effortful sounds and more effortful sounds. As such, MULE enhances the explanatory
power of linguistic theory, and provides a disciplined way to integrate phonetics and
information theoretic considerations.
v
To my parents
vi
Acknowledgments
There are no single authors in academic research. New research builds on previous
work and on the authors’ interaction with others: those who taught them how to do
research, and if they are lucky, those who taught them how to become better people.
I was fortunate to be a graduate student at Stanford Linguistics for the past few
years, the most collaborative, inspiring, and nurturing academic environment I know.
I owe many thanks to my professors, colleagues, administrative staff and friends at
Stanford for helping me overcome the perils of being an expatriate in an American
graduate school.
I am grateful to my committee for their intellectual support and patience. Each
of them has had an important role in my development as a graduate student. One
of the greatest benefits of going to Stanford was a chance to work with my adviser,
Dan Jurafsky. Dan’s curiosity, intellect and inherent dislike for pretheoretical consid-
erations are responsible for the most rapid exchanges of ideas that I have ever had.
Dan’s trust in me and his willingness to support me and to keep me on the right track
never ceased to astonish me. My graduate school experience and my work would have
suffered greatly if it were not for Dan’s intellectual influence and personal example.
Paul Kiparsky, a true Renaissance man, introduced me to many of the theoretical
concepts and challenges that have ultimately shaped my work. Our long conversa-
tions about linguistics, politics and life’s mysteries never took a predictable course,
and every turn revealed new insights, problems and solutions. Arto Anttila’s precise
approach to research encouraged me to better ground my arguments. Arto read the
greatest number of versions of any of my manuscripts, and I can see the effect of
his detailed feedback in almost every page. Chris Manning’s ability to understand
immediately any topic I wished to discuss and to provide me with useful input and
vii
criticism is admirable. Even more admirable is his ability to teach me not only how
to argue for my ideas, but more importantly how not to argue for them. Finally,
Meghan Sumner showed me the beauty of phonetics and is responsible for my desire
to ground phonological arguments in phonetics. Meghan sets herself as an example
at the academic and personal level. In doing so she inspires others to follow. Thank
you!
One of the things I am most grateful for in my experience as a graduate student is
the sense of having an adoptive family, of a home away from home. While many took
part in making the department warm and welcoming, a few deserve special thanks.
Beth Levin’s door was always open for students to wander in, and I often sought her
counsel. Beth would listen, advise and help in whichever way she could. The fact
that she did all that and at the same time made sure I met every necessary deadline
is remarkable. Penny Eckert listened, encouraged, and offered advice. I made many
more random trips to the departmental kitchen in hope of catching a word with her.
Ivan Sag made sure I remained inspired and provided humor-infused theoretical and
personal insights. Vera Gribanova made post-graduation work seem reachable, and
provided more support than I dared to wish for. I remember and appreciate your
support.
I thank my many friends at Stanford for helping me survive graduate school with
a smile on my face. I had a wonderful and supportive cohort. Matthew Adams has
been my fellow phonologist, counselor and friend from my first weeks at Stanford,
and shared with me the ups and downs of graduate school. Roey Gafter miraculously
managed to know how I feel and what I was about to do well before I did. I consider
myself fortunate to have you two as my friends. Inbal Arnon, Elisabeth Norcliffe
and Hal Tily successfully lured me into San Francisco to prove that there’s life that
does not involve Stanford even if it does involve linguistics (and the sounds of a
band playing at Revolution). I would not have known Laura Smith and Fabian
Goppelsroeder had I not gone to Stanford, and that would have been a terrible thing
indeed. To my Israeli friends, thank you for making sure you were just a phone call
away.
Ariela Raviv has unwittingly undertaken the task of standing for sanity and com-
mon sense in my life through graduate school, a task she managed with ease, equipped
viii
with wickedly sharp humor, a down-to-earth approach and very little patience for
delusions of any sort. She proposed once that by doing so she put me through med
school even though I am not a real doctor. She has a point there, and there is a few
years’ worth of chat history to prove that. For all of her support and for her stout
belief in me, I am grateful.
Finally, I dedicate this thesis to my parents, who wanted me to go abroad and
pursue what was good for me, even when it was clearly hard for them that I was away.
I am grateful for their encouragement to keep challenging myself, and for always being
happier with my accomplishments than anyone else.
ix
Contents
Abstract iv
Acknowledgments vii
1 Introduction 1
1.1 Explaining the phonology of a language . . . . . . . . . . . . . . . . . 1
1.2 Language specific patterns . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Predicting language-specific patterns . . . . . . . . . . . . . . . . . . 3
1.4 Balancing effort and information: MULE . . . . . . . . . . . . . . . . 5
1.5 Sign and signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.6 Cross-linguistic generalizations . . . . . . . . . . . . . . . . . . . . . . 8
1.7 The explanatory power of MULE . . . . . . . . . . . . . . . . . . . . 9
2 Information content affects performance 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Stop deletion and duration paradox . . . . . . . . . . . . . . . . . . . 11
2.3 Previous accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Phonetic accounts . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Markedness accounts . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Frequency accounts . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.4 Local predictability accounts . . . . . . . . . . . . . . . . . . . 18
2.4 Informativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Segment duration and deletion studies . . . . . . . . . . . . . . . . . 25
2.5.1 Studies overview . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 Intervocalic consonant duration . . . . . . . . . . . . . . . . . 26
x
2.5.3 Intervocalic segment deletion . . . . . . . . . . . . . . . . . . 31
2.5.4 Postvocalic segment duration . . . . . . . . . . . . . . . . . . 33
2.5.5 Postvocalic segment deletion . . . . . . . . . . . . . . . . . . . 34
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.7 Segment Alignment Process . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Calculating information theoretic measurements . . . . . . . . . . . . 41
2.9 Deletion and duration models . . . . . . . . . . . . . . . . . . . . . . 44
2.9.1 Intervocalic segment duration model . . . . . . . . . . . . . . 44
2.9.2 Intervocalic segment deletion model . . . . . . . . . . . . . . . 45
2.9.3 Postvocalic segment duration model . . . . . . . . . . . . . . . 46
2.9.4 Postvocalic segment deletion model . . . . . . . . . . . . . . . 47
3 Faithfulness as Information Utility 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 Parallel weakening processes . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Weakening patterns are not arbitrary . . . . . . . . . . . . . . 51
3.2.2 Same language, same segments, multiple processes . . . . . . . 52
3.2.3 Same language, different dialects, similar processes . . . . . . 54
3.2.4 The challenge of explaining parallel weakening . . . . . . . . . 58
3.3 The proposed account . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Outline – replacing markedness hierarchies . . . . . . . . . . . 58
3.3.2 Using information utility and effort to predict parallel weakening 60
3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 Implementation overview . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Measuring information utility . . . . . . . . . . . . . . . . . . 63
3.4.3 Integrating effort into MULE . . . . . . . . . . . . . . . . . . 65
3.5 MULE in OT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5.1 Effort and information utility in OT . . . . . . . . . . . . . . 66
3.5.2 Binary comparisons in OT . . . . . . . . . . . . . . . . . . . . 67
3.5.3 Real-valued comparisons in OT . . . . . . . . . . . . . . . . . 73
3.6 The necessity of multiple scales and language-specificity . . . . . . . . 80
3.6.1 The difference between MULE and current theories . . . . . . 80
3.6.2 Parallel weakening in standard OT . . . . . . . . . . . . . . . 80
xi
3.6.3 No single scale can replace markedness . . . . . . . . . . . . . 83
3.6.4 Universal scales do not suffice . . . . . . . . . . . . . . . . . . 83
3.7 Information-theoretic accounts . . . . . . . . . . . . . . . . . . . . . . 85
3.7.1 Information-theoretic explanations . . . . . . . . . . . . . . . 85
3.7.2 Functional load as entropy . . . . . . . . . . . . . . . . . . . . 85
3.7.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.7.4 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.7.5 Informativity accounts . . . . . . . . . . . . . . . . . . . . . . 94
3.7.6 Why information-theoretic accounts do not suffice . . . . . . . 95
3.8 Variable deletion rates of stems and affixes . . . . . . . . . . . . . . . 96
3.8.1 The contrast between American English and Puerto Rican Span-
ish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.8.2 The information of English verbal -ed morpheme . . . . . . . 98
3.8.3 The information of Spanish plural -s morpheme . . . . . . . . 100
3.8.4 MULE’s predictions are measurable . . . . . . . . . . . . . . . 101
3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4 Lexicon, usage and information 105
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Methodology and sources of data . . . . . . . . . . . . . . . . . . . . 107
4.2.1 The choice of test cases . . . . . . . . . . . . . . . . . . . . . . 107
4.2.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3.1 American English . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.3.2 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3.3 Egyptian Arabic . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5 Predicting segment distribution universals 125
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.2 Consistent asymmetries between complex and simple segments . . . . 127
5.3 Solution: maximizing information per effort . . . . . . . . . . . . . . 130
xii
5.4 Survey 1: effort or complexity? . . . . . . . . . . . . . . . . . . . . . 133
5.5 Survey 2: voiceless stops . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.6 Markedness as effort . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.6.1 Missing pieces in the puzzle . . . . . . . . . . . . . . . . . . . 138
5.6.2 Sanskrit stop frequencies . . . . . . . . . . . . . . . . . . . . . 139
5.6.3 Indonesian stop frequencies . . . . . . . . . . . . . . . . . . . 140
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6 Conclusions 143
xiii
List of Tables
2.1 Buckeye word-medial stop relative durations . . . . . . . . . . . . . . 13
2.2 Buckeye word-medial stop deletion probabilities . . . . . . . . . . . . 13
2.3 Buckeye word-medial stop duration, deletion and probability . . . . . 18
2.4 Buckeye word-medial stop duration, deletion and informativity . . . . 24
2.5 Segment properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Environment properties . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7 Variables of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8 Dictionary and surface alignment penalties . . . . . . . . . . . . . . . 38
2.9 Buckeye to CMU valid substitution . . . . . . . . . . . . . . . . . . . 39
3.1 Egyptian Arabic Informativity-based information estimates . . . . . . 71
3.2 English Informativity-based information estimate . . . . . . . . . . . 74
3.3 Spanish Informativity-based information estimate . . . . . . . . . . . 75
3.4 Post-vocalic pre-consonantal word-final obstruent deletion controls . . 78
3.5 Post-vocalic pre-consonantal word-final obstruent deletion fixed effects 79
3.6 Functional load of English with different final consonant deletion, scaled
by a factor of 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.1 Segment phonological properties∗ . . . . . . . . . . . . . . . . . . . . 110
4.2 Sample American English stress data . . . . . . . . . . . . . . . . . . 113
4.3 American English pure phonology model . . . . . . . . . . . . . . . . 114
4.4 American English information model . . . . . . . . . . . . . . . . . . 115
4.5 Sample Spanish stress data . . . . . . . . . . . . . . . . . . . . . . . . 116
4.6 Spanish pure phonology model . . . . . . . . . . . . . . . . . . . . . . 117
4.7 Spanish information model . . . . . . . . . . . . . . . . . . . . . . . . 118
xiv
4.8 Sample Egyptian Arabic stress data . . . . . . . . . . . . . . . . . . . 119
4.9 Egyptian Arabic pure phonology model . . . . . . . . . . . . . . . . . 120
4.10 Egyptian Arabic information model . . . . . . . . . . . . . . . . . . . 121
5.1 Voiceless and voiced segment probabilities . . . . . . . . . . . . . . . 135
xv
List of Figures
5.1 Distance of total orders from /t/>/k/>/p/ . . . . . . . . . . . . . . . 137
xvi
Chapter 1
Introduction
1.1 Explaining the phonology of a language
One of the goals of linguistic theory is to explain which languages and linguistic
processes are possible, and which languages and linguistic processes cannot exist. The
same goal applies in the context of phonology. Phonology should be able to explain
the existence of observed phonological processes and be able to exclude phonological
processes that do not exist. Consider the case of word-final deletion. Puerto Rican
Spanish variably deletes word-final /s/ (Poplack, 1980). American English does not
delete word-final /s/ but variably deletes word-final /t/ (Guy, 1991, among others). In
contrast, cases of word-final consonant epenthesis are exceptionally rare. Phonology
as a theory should define the set of possible languages such that it includes languages
that delete word-final /s/, languages that do not delete word-final /s/ (but possibly
delete word-final /t/), and excludes languages that epenthesize consonants word-
finally.
But describing the set of possible languages does not suffice. Phonology should
have another goal as well: to be able to explain the correspondence between the
phonology of a language and other properties of the language such as its lexicon
and its usage patterns. The first goal requires phonological theory to describe the
phonology of a language that does not delete word-final /s/ but does delete word-
final /t/. The second goal requires phonological theory to explain why American
English has that particular phonology, rather than a phonology in which word-final
1
CHAPTER 1. INTRODUCTION 2
/t/ is preserved and word-final /s/ is deleted. I label the second goal of phonological
theory the correspondence problem. The correspondence problem takes many forms
and applies in all levels of phonological representation. At the level of linguistic
performance, answering the correspondence problem can take the form of explaining
the different deletion rates of segments in a language. At the level of competence and
change, the goal of the correspondence problem overlaps in part with the actuation
problem in Weinreich et al. (1968) – what causes some language at some point in time
to undergo a particular change process. Timing is not the focus of the correspondence
problem, but rather the set of processes that are likely to affect some languages but
not other languages.
1.2 Language specific patterns
Chapters 2 and 3 deal with several cases that demonstrate that the phonology of a
language is not random, but corresponds to other properties of the language. Chapter
2 explains what leads the consonants of American English to have the duration and
deletion rates that they have. I show that the durations and deletion rates of American
English consonants do not follow from the phonetic properties of its segments. /g/
has the longest duration of all voiced stops, even though Ohala (1981) argues that
maintaining voicing is more difficult for dorsals. Similarly, Ohala claims that /p/ is
the least audible voiceless stop. Less audible stops are more likely not to be perceived
by listeners, which may lead to increased deletion rates, yet in American English /p/ is
the least likely to delete of all voiceless stops. The phonetic markedness of /g/ and /p/
is supported by cross-linguistic evidence. /g/ is absent from more sound inventories
than any other voiced stops, and /p/ is absent from more sound inventories than any
other voiceless stop (Sherman, 1975). Therefore, the long duration of /g/ and the
low likelihood to delete /p/ do not follow from universal tendencies, but rather from
the specific properties of American English. What causes American English to have
such unusual patterns?
Chapter 3 explains why languages can be affected by multiple parallel weakening
processes that target specific segments but not other segments. Many varieties of
English have some form of /t/-weakening in intervocalic and word-final positions. In
CHAPTER 1. INTRODUCTION 3
Irish English alone, different varieties spirantize, tap or debuccalize intervocalic /t/
(1.1, data from Raymond 2004).
(1.1) Variety ‘butter’
Northern varieties [b2tˆ@ô]
Southern varieties [b2R@õ]
Vernacular Dublin [bUP5]
Tapping varieties are incompatible with debuccalizing varieties. Tapping changes the
manner of articulation of /t/ but preserves its place of articulation. Debuccalization
loses the place of articulation of /t/ but preserves its manner of articulation. The
common ancestor of tapping and debuccalizing varieties is therefore a variety in which
/t/ is a coronal stop, albeit a weakening-prone coronal stop. What makes /t/ prone
to undergo weakening in English and not in other languages?
It is not possible to argue that English /t/ is prone to weaken because it is a /t/.
Cross-linguistically, /t/-weakening processes are not very common as the language
surveys in Kirchner (1998) and Gurevich (2004) show. Moreover, other languages
have similar plethora of processes that target a specific segment, except that segment
is not /t/. Arabic, for instance, has multiple /q/-weakening processes (1.2, data from
Kaye and Rosenhouse 1997), but does not similarly weaken /t/.
(1.2) Dialect baqara ‘cow’ (MSA)
Druze [baqara]
Nazareth [bakara]
Jerusalem [baPara]
NW Jordan [bagara]
What causes English to have many /t/-weakening processes, and Arabic to have many
/q/-weakening processes? Even if phonology can easily describe all the processes
listed here, it should also be able to explain why English is the target of several
/t/-weakening processes, and Arabic of several /q/-weakening processes.
1.3 Predicting language-specific patterns
What are the factors that cause some language to have a particular set of phonological
processes? Linguistic theory focuses on the tension between universal constraints
CHAPTER 1. INTRODUCTION 4
and their language-specific interactions. Universal constraints apply to all languages
equally, but in each language they interact in a different way. Different interactions
lead to differences between languages. Optimality Theory (Prince and Smolensky,
1993) is a typical example of that approach; constraints are taken to be universal, and
their relative ranking is language-specific. There is a strong motivation to consider
all constraints universal. The articulatory, perceptual and psychological abilities and
limitations involved in communication are common to the speakers of every human
language. Solutions to the correspondence problem therefore cannot follow from the
introduction of language-specific constraints. Instead, such solutions should constrain
the possible rankings of universal constraints, such that the ranking that yields a
specific outcome is more likely to be found in some languages but not in others. The
question is therefore how to motivate different rankings in different languages.
Phonological markedness is often motivated by the phonetic properties of human
language such as the articulation and perception of sounds (Kirchner, 1998; Flem-
ming, 2004; Steriade, 2008, among many others). Some segments are considered more
effortful to produce than others, and some distinctions are more difficult to perceive
in certain phonetic environments. Phonetic constraints cannot be language-specific,
as they apply to every language in the exact same way. If it is more difficult to
tell whether pre-consonantal and word-final obstruents are voiced or not, it would be
equally difficult to do so in every language with a similar inventory of sounds. Thus,
phonetics-based reasoning plays an important role in defining the set of constraints
that universally forbid the existence of certain languages and processes.
Phonological theory posits another type of markedness, namely marked faithful-
ness (Kiparsky, 1994; de Lacy, 2002, among others). Marked faithfulness signifies the
greater pressure to preserve marked elements. Marked faithfulness can be used to
explain the case of word-final /t/-deletion in American English, from which the more
marked /k/ and /p/ are exempt. In such an analysis, /k/ and /p/ can resist the
weakening process that affects /t/ because language ranks the pressure to preserve
them higher than it ranks the pressure to preserve /t/. Marked faithfulness is difficult
to motivate on phonetic grounds. If some segment is more difficult to articulate or
perceive, what phonetic principle would encourage speakers to preserve it while not
preserving less effortful segments?
CHAPTER 1. INTRODUCTION 5
1.4 Balancing effort and information: MULE
I propose that speakers attempt to maximize the amount of information they transmit
(Most information Utility) and minimize the amount of articulatory and perceptual
effort communication requires (Least Effort), or MULE. If the goal of communication
is to transmit information between speakers and listeners, speakers should be willing
to put in more effort in order to transmit more information. In MULE marked
faithfulness is motivated by the preservation of information in human language, and
markedness is motivated by the reduction of perceptual and articulatory effort.
One theoretical benefit of using information as the basis for marked faithfulness is
that unlike the articulatory and perceptual properties of speech, different languages
assign varying amounts of information to different linguistic elements. Therefore, the
pressure to preserve linguistic elements is not the same cross-linguistically.
1.5 Sign and signal
What role does information play in human language? The reduction of frequent
words has been observed by Sibawayhi, an Arabic grammarian in the 8th century
(Al-Nassir, 1993; Carter, 2004). Zipf (1929) expected frequently used linguistic el-
ements to undergo reduction, which increases the ease of articulation. Zipf (1935)
described over-frequent sounds in a hypothetical language in terms of lack of informa-
tion: “completely unessential to a perfectly adequate conveyance of any word.” Zipf’s
latter description is very close to the terms that an account based on information the-
ory (Shannon, 1948) would use. In information theory, the amount of information
that an event carries is the negative log probability of observing the event. Everything
else being equal, frequent linguistic elements carry less information than infrequent
ones, and can therefore be reduced.
Information theory allows Zipf’s predictions to be extended from frequency to
predictability. Some linguistic elements can be infrequent, yet predictable in context.
The English word abode is less frequent than the word mansion, yet in the context
of my humble —, abode is more predictable than mansion, and therefore carries less
information. If speakers are careless in their pronunciation of abode in this context,
listeners would more easily recover the speakers’ intention. Recent research linked
CHAPTER 1. INTRODUCTION 6
predictability in context to phonetic and syntactic reduction (Jurafsky et al., 2001;
van Son and Pols, 2003; Aylett and Turk, 2004; Pluymaekers et al., 2005; Levy and
Jaeger, 2007; Jaeger, 2010, among many others). These studies show that linguistic
elements that are predictable in context are more likely to be omitted or otherwise
reduced.
There are many possible reasons that could lead linguistic elements with high
frequency or high predictability to be reduced or omitted. Some of the reasons are
speaker-internal and may or may not affect communication. Other reasons follow from
communication principles. It is easier to recover frequent and predictable words,
which may lead speakers to be lax in their pronunciation (Aylett and Turk, 2004,
among others). In this view frequent and highly predictable linguistic elements are
redundant, and their reduction makes language more efficient. A related view assumes
that speakers attempt to transmit as much information as language allows without
causing miscommunication (Levy and Jaeger, 2007; Jaeger, 2010). In this view trans-
mitting too much information or too little information is to be avoided, leading to the
omission and reduction of frequent and predictable linguistic elements, but possibly
also to the temporal elongation of unpredictable and infrequent linguistic elements.
MULE builds on research that regards the amount of information linguistic ele-
ments hold as an important factor in human communication, but differs in two im-
portant ways. First, there is ample research in phonology and phonetics that shows
that phonological markedness is related to articulatory and perceptual effort. For in-
stance, vowel inventories show a balance between effort and low confusability (Flem-
ming, 2004). Languages do not have vowel inventories in which higher effort does not
decrease confusability. Therefore, it is not necessary to assume that low information
leads to reduction. It suffices to assume that low-information (frequent, predictable)
linguistic elements do not resist effort reduction as well as high-information linguistic
elements do.
The other difference between MULE and the research it builds on is the impor-
tance it gives to the difference between language and abstract communication systems.
There is a tension between the optimization of a signal in a communication channel,
and the role of individual signs in the signal. Languages seem to differ from abstract
CHAPTER 1. INTRODUCTION 7
communication systems in what they regard as the information-carrying unit. In ab-
stract communication systems, the identity of each sign in a communication channel
is irrelevant. A /t/ in some word is completely unrelated to a /t/ in another word.
The fact that both /t/ sounds are phonetically similar to one another is coincidental.
But language does seem to care about the identity of linguistic elements. Cohen Priva
(2008) showed that speakers of American English omit consonants that are usually
predictable even in contexts in which they are not predictable, and preserve conso-
nants that are usually unpredictable even in contexts in which they are predictable.
Languages seem to optimize the communication signal while taking individual lin-
guistic elements (or signs) as the relevant level for optimization.
These two differences are the focus of this dissertation. In chapter 2 I show that
the durations and deletion rates of American English consonants correspond to the
amount of information consonants carry. The duration of /g/ is longer than the dura-
tion of other voiced stops because it carries an unusually high amount of information,
even though it takes greater articulatory effort to maintain the voicing of a dorsal
stop. The deletion rates of /p/ are lower than the deletion rates of other voiceless
stops because it is highly informative as well. The amount of information each con-
sonant carries is an aggregate of every instance of that segment in the language, and
applies even when a usually predictable segment is unpredictable, or when a usually
unpredictable segment happens to be predictable. The correct predictions for the
relative durations and deletion rates of American English require speakers to carry
over the amount of information each consonant usually holds in the language to every
individual instance of that consonant.
Similar principles solve the puzzle of chapter 3 by predicting the accumulation
of weakening processes that target specific segments in a given language. I show
that in English, /t/ is usually very predictable, which leads it to carry an unusually
low amount of information compared to other languages. In Arabic, /q/ carries less
information than less marked segments, and weakens because its phonetically marked
articulation cannot be justified by its relatively low information. It is not possible to
claim that /t/ is more effortful in English than it is in other languages. It is similarly
impossible to argue that Arabic reduces /q/ because it is too frequent, when more
CHAPTER 1. INTRODUCTION 8
frequent stops are not reduced. In both languages the key to predicting language-
specific weakening lies in the balance between information and effort, and in the
ability to attribute the average amount of information a linguistic element carries in
a language to each individual instance of that element.
1.6 Cross-linguistic generalizations
The balance between the preservation of information and effort-reduction applies dif-
ferently in every language because the amount of information a linguistic element
carries varies among languages. However, since the same principles apply cross-
linguistically, it is possible to move from language-specific patterns to predicting
cross-linguistic generalizations. This is the focus of chapters 4–5.
Chapter 4 shows that the assumption that language attempts to preserve infor-
mation predicts the distribution of segments in perceptually salient positions. I show
that in English, Spanish and Egyptian Arabic, segments that carry a lot of infor-
mation are more likely to appear in the onsets of stressed syllables. In the previous
two chapters, more informative linguistic elements justified the expenditure of addi-
tional articulatory effort. In the case of stressed syllables the perceptual prominence
of stressed syllables is distributed unevenly to more informative and less informative
linguistic elements. The more informative a linguistic element is, the more likely it
is to benefit from perceptual prominence. Thus language is shown to distribute its
resources such that the preservation of information is guaranteed. Moreover, chapter
4 shows that the considerations that apply to linguistic performance (chapter 2) as
well as linguistic competence and change (chapter 3) apply at the level of the lexicon
and the usage patterns of the language.
In chapter 5 I demonstrate how the balance between information and effort pre-
dicts that the frequency of some segments such as /t/ will usually be higher than
the frequency of other segments such as /p/. A pure information-theoretic account
predicts that all segments would be equally frequent, while a pure effort-avoidance
account predicts that all effortful segments would be avoided. Both predictions do
not hold in any language. I show that the frequency of segments in each language
matches their phonetic properties (Ohala, 1981) and their absence from the sound
CHAPTER 1. INTRODUCTION 9
systems of the world’s languages (Sherman, 1975). Thus, most voiceless stops are
usually more frequent than their voiced counterparts, but /p/ is not necessarily more
frequent than /b/, just as their phonetic properties and their absence from sound
systems cross-linguistically would predict. I discuss a new prediction that /t/ will be
more frequent than /k/ and /k/ will be more frequent than /p/. The cases in which
this prediction does not hold illuminate the need to factor both articulatory effort and
perceptual confusability into phonological markedness. Together, these predictions
show how even though the amount of information linguistic elements hold is different
in every language, all languages attempt to optimize their sound systems and lexicons
to achieve a balance between information and effort.
1.7 The explanatory power of MULE
MULE is the idea that transmitting information is a goal in human language, and
that languages are biased to allow the expenditure of linguistic resources in order to
transmit information. When more information is being transmitted, the expenditure
of additional resources is justfied. The following chapters show how MULE solves
four separate cross-linguistic puzzles.
Chapter 2
Information content affects
performance
2.1 Introduction
Speakers vary the duration of segments and occasionally delete (omit) segments. Such
phonetic properties are not taken to be governed by speakers’ linguistic competence.
Some speakers’ /t/ may be longer than their /k/, or they may delete /p/ more
frequently than they delete /k/, but such tendencies will not lead to a conclusion
that they have different competence grammars.1 However, in this chapter I show that
the duration and deletion patterns of segments in American English are systematic.
Speakers of American English do typically have longer and shorter segments, and
some segments are omitted more frequently than other segments. I demonstrate that
the key factor that leads to the different durations and occasional deletion rates is
the information content of segments. Everything else being equal, the higher the
information content of a segment, the longer its duration will be and the less likely
it is to be deleted. This finding shows that information content affects linguistic
performance, and foretells the generalization of such tendencies into competence-
related phonological rules.
In evaluating the information content of segments, I consider well-known factors
such as frequency and predictability, but show that informativity (Cohen Priva, 2008;
1I exclude here languages in which segment duration in contrastive.
10
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 11
Piantadosi et al., 2011), the average or expected predictability of each segment, plays
a key role in explaining segment duration and deletion patterns. Informativity is the
amount of information a linguistic element usually has, across the entire language,
and can therefore explain why segments that are usually unpredictable have longer
duration and are less likely to delete even when they are predictable in context.
I begin this chapter by showing that American English voiceless, voiced and nasal
stops each have different duration and deletion rates among places of articulation –
/b/ is shorter and more likely to delete than /g/, but /k/ is shorter and more likely
to delete than /p/. This pattern highlights some of the shortcomings of markedness-
based theories (Prince and Smolensky, 1993; de Lacy, 2002), and is not predictable
from articulatory and perceptual factors (Ohala, 1981). I show why frequency (Zipf,
1929) and predictability-based explanations (Aylett and Turk, 2004; van Son and van
Santen, 2005) are also insufficient and necessitate the introduction of informativity. I
evaluate the relative effect information content measurements and in particular infor-
mativity have on segment duration and occasional deletion in several corpus studies
that weigh the effect each predictor has while factoring out other predictors. Finally,
I discuss the implications that the move to informativity has on the way information
theoretic constraints interact with linguistic performance and competence.
2.2 Stop deletion and duration paradox
In American English, the duration and the likelihood to delete of word-medial nasal
and oral stops is not consistent across places of articulation. While the voiced labial
stop /b/ has a shorter duration and is more likely to delete than voiced dorsal stop
/g/, the nasal and voiceless labial stops /m/ and /p/ have longer duration and are
less likely to delete than the nasal and voiceless dorsal stops /N/ and /k/. Table
2.1 provides the mean duration of each stop, and Table 2.2 provides the deletion
probabilities of each stop.
It is important to notice that the question asked here has to do with the rela-
tionship between underlying phonemes and their surface form. The question is not
what is the duration of glottalized /t/ (/t/→[P]), but rather what is the duration of
the phoneme /t/, which may happen to be glottalized in this context. Glottalization
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 12
in this interpretation is a process that can affect the duration of /t/. This duration
change should be explained, rather than taken as given.
I use data calculated using the invaluable Buckeye Corpus (Pitt et al., 2007),
which contains transcribed and time-aligned interviews with speakers of American
English in Columbus, Ohio, collected and annotated by researchers at the Ohio State
University. I matched the corpus’ underlying (dictionary) word representations with
their actual pronunciation. Words that did not have the same number of vowels as
their CMU dictionary (Weide, 1998) equivalents were excluded. Segments that had
no surface equivalent were considered to be deleted. This means that an articulatory
merger of two segments is regarded as a deletion of one of the segments.2 3 The
duration of the segments was multiplied by the rate of speech of the speaker, yielding
a more robust assessment of duration, in essence the proportion between the actual
duration of the segment and the mean duration of all segments in that utterance.4
The difference in mean duration is significant among all voiced, voiceless, and nasal
stops, with the exception of /n/ and /N/, which do not differ from one another
(Welch Two-Sample t-test p < 0.05).5 Similarly, the difference in likelihood to delete
is significant among all voiced, voiceless, and nasal stops (Fisher’s Exact Test p <
10−5).6 The data includes several speakers that were not used in Cohen Priva (2008).
What motivates the different duration and deletion ratios across places of ar-
ticulation? In the next sections I consider several possible explanations including
articulatory and perceptual factors, markedness, predictability and frequency, and
show that they do not suffice. I then reintroduce the concept of segment informa-
tivity, the expected value of segment predictability, and show how it provides the
2The deleted segment was considered to be the one less similar to the segment present in theoutput form, using a similarity metric detailed in §2.7.
3See §2.7 for a more complete description of the alignment process.4When segment duration, which is measured in seconds, is multiplied by speech rate, which
is measured in segments per second, the resulting measure unit is “segments”. Measuring a seg-ment’s duration in segments in essence compares the segment to other segments. Stressed vowelstend to have longer durations than unstressed vowels, and most consonants have shorter durationsstill. Therefore, the duration in segments of diphthongs is > 1, the duration in segments of shortconsonants such as /b/ is < 1.
5 Welch Two Sample t-test for stop durations: /d/ < /b/ (p < 10−15), /b/ < /g/ (p < 0.05),/t/ < /k/ (p < 10−15), /k/ < /p/ (p < 10−15), /N/ < /n/ (p > 0.97), /n/ < /m/ (p < 10−15)
6Fisher’s Exact Test for stop deletion: /d/ > /b/ (p < 10−15), /b/ > /g/ (p < 10−12), /t/ >/k/ (p < 10−15), /k/ > /p/ (p < 10−15), /n/ > /N/ (p < 10−7), /N/ > /m/ (p < 10−6)
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 13
Table 2.1: Buckeye word-medial stop relative durations
Place Voiceless Stops Voiced Stops Nasal StopsSegment Duration Segment Duration Segment Duration
Labial /p/ 1.123 /b/ 0.805 /m/ 0.881Dorsal /k/ 1.032 /g/ 0.829 /N/ 0.773Coronal /t/ 0.775 /d/ 0.587 /n/ 0.773
Table 2.2: Buckeye word-medial stop deletion probabilities
Place Voiceless Stops Voiced Stops Nasal StopsSegment Del. Prob. Segment Del. Prob. Segment Del. Prob.
Labial /p/ 0.013 /b/ 0.113 /m/ 0.025Dorsal /k/ 0.020 /g/ 0.054 /N/ 0.046Coronal /t/ 0.160 /d/ 0.175 /n/ 0.072
missing link to accounting for the different durations and deletion ratios.
2.3 Previous accounts
2.3.1 Phonetic accounts
Ohala (1981) lists at least two differences that may account for the asymmetry be-
tween the hierarchy of voiced and voiceless stops. First, it is more difficult to maintain
voicing in /g/ than in /b/ due to the smaller amount of space between the vocal folds
and the closure. In contrast, /p/ is the least audible voiceless stop. Both asymmetries
emerge in the cross-linguistic frequencies of gapped inventories – inventories in which
one or more of the stops does not exist. There are more gapped inventories in which
/g/ does not exist than gapped inventories in which /b/ does not exist, and more
gapped inventories in which /p/ does not exist than gapped inventories in which /k/
does not exist (Sherman, 1975).
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 14
While both asymmetries are well-established, their effect on the duration and
deletion ratios of American English stops is expected to be the opposite of what is
observed. If it is more difficult to maintain voicing in /g/ than in /b/, the duration
of /g/ should be shorter than the duration of /b/, but in American English the
duration of /b/ is shorter than the duration of /g/. If /p/ is less audible than other
voiceless stops, listeners will record more instances in which they did not perceive a
/p/ in places in which speakers have articulated a /p/, and will therefore have more
exemplars of (perceived) p-deletion which they may feel more comfortable repeating,
but in American English /k/ deletes more than /p/ does.
It seems that the difference between the durations and deletion rates of English
stops is not due to articulatory and perceptual reasons. I will now consider other
possible explanations: phonological markedness, frequency and predictability.
2.3.2 Markedness accounts
Markedness hierarchies, as in Prince and Smolensky (1993) and de Lacy (2002) are
used to provide an explanation for phone-specific patterns in phonology. In this
section I spell out such an attempt for the relevant data. Since coronals, which
are considered less marked than labials and dorsals, delete more frequently, we may
suggest that more marked segments delete less frequently than less marked ones.
Since coronals are taken to be less marked than labials and dorsals (Prince and
Smolensky, 1993; de Lacy, 2002), a markedness-based account can solve the contrast
between coronals and non-coronals (dorsals and labials). Word medially, coronals
have shorter duration and have higher deletion ratios than labials and dorsals, with
the exception of /N/, which has shorter duration and deletes more than /n/. The
tableau in (2.1) sketches a system in which coronals delete in coda positions, but
labials and dorsals do not delete, by using marked-faithfulness constraints of the
form Max{K} to preserve dorsals and Max{P} to preserve labials (I am not arguing
here for the existence of such constraints).
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 15
(2.1) Only coronals delete in codas
tat.tap.tak Max{K} Max{P} *NoCoda Max
tat.tap.tak ***!
+ ta.tap.tak ** *
tat.ta.tak *! ** *
tat.tap.ta *! ** *
ta.ta.tak *! * **
tat.ta.ta *! * * **
ta.tap.ta *! * **
ta.ta.ta *! * ***
However, the binary distinction between coronals and non-coronals does not ac-
count for the different durations and deletion ratios between dorsals and labials. If
the least marked (coronals) delete more than more marked stops (labials and dorsals),
then the data suggests that /b/ should be less marked than /g/ since it deletes more
frequently than /g/: Max{K} � Max{P}. However /k/ deletes more frequently
than /p/ does, and /k/ should be less marked than /p/: Max{P} �Max{K}. One
way to solve this problem is to further specify the Max constraints so that they spec-
ify not only place but also voicing: Max{K,voiced} �Max{P,voiced} and Max{P}� Max{K}.
Should the markedness of place of articulation differ between voiced and voiceless
stops? Indeed, in the UPSID database (Maddieson, 1984) more languages have /k/
than /p/ (403 : 375), and more languages have /b/ than /g/ (287 : 253), even though
this difference is not significant (Fisher test, p>0.05). If it were significant, it could be
used to argue that /g/ is more marked than /b/ and that /p/ is more marked than /k/.
Using that markedness hierarchy to argue for marked faithfulness constraints that
specify both place of articulation and voicing would explain the difference between
the durations and deletion ratios of voiced and voiceless oral stops using the following
hierarchy of Max constraints: Max{K, voiced} � Max{P, voiced}, and Max{P}� Max{K}.
However, using a similar procedure for nasal stops yields incorrect predictions.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 16
More languages in UPSID have /m/ than /N/ (425 : 237). In comparison with
voiceless stops, this difference is significant (Fisher test, p < 0.001). If these data
are used to conclude that Max{K,nasal} � Max{P,nasal}, the prediction would be
that that /m/ should be shorter and more likely to delete than /N/, but this is not
the case in American English.
This proposal has several additional complications at the theoretical and func-
tional level. First, (de Lacy, 2002, §6.4.2) argues specifically against marked faithful-
ness accounts for the family of Max constraints. Second, while constraint conjunc-
tion is used liberally in markedness accounts to account for the conjunction of several
markedness features as in Ito and Mester (2003), the above-mentioned proposal uses
the conjunction of marked-faithfulness constraints, which may have other undesirable
consequences. Finally, if marked segments are less frequent in the world’s languages,
it is not clear why speakers would preserve marked segments rather than unmarked
ones, going against the cross-linguistic tendency.
As de Lacy (2002, §6.4.2) argues, it is problematic to use marked-faithfulness ac-
counts to predict that coronals should delete in cases where labials and dorsals do
not delete. In this section I have shown that this approach is equally problematic in
accounting for the variable deletion rates of other stops, as it necessitates the con-
junction of marked faithfulness constraints, and even if such constraints are allowed
in the system, the predictions that follow from such accounts are incorrect.
2.3.3 Frequency accounts
There is evidence that information theoretic concerns (Shannon, 1948) affect linguistic
behavior. Zhao and Jurafsky (2009) report that associating word frequency with word
reduction goes back to observations made by Sibawayhi, an Arabic grammarian of the
8th century (Al-Nassir, 1993; Carter, 2004). Zipf (1929) claims that the reduction of
frequent linguistic elements follows from usage – frequent elements are under a greater
pressure to become efficient. Greater efficiency implies simplification and reduced
duration. Perhaps when the duration of segments is reduced too much deletion also
follows.
Reducing frequent segments can be interpreted as beneficial from a stricter in-
formation theoretic interpretation. Given no other information, frequent linguistic
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 17
elements are more likely to occur than less frequent elements, and can therefore be
more easily recovered by listeners. If some information is available, for instance if
the listeners know that they heard a voiceless stop, but do not know which one they
heard, guessing that it was a /t/ makes a better guess than /k/ or /p/, since /t/ is
more frequent than /k/ or /p/.
A simple way to measure the frequency of a segment σk in a language is to count
the number of times that segment was encountered in a subset of the language (2.2).
It is then possible to compare each segment’s frequency with the frequency of all
segments in the same subset (2.3). Since (2.3) is the same for all segments σi, it
is possible to always divide (2.2) by (2.3) to yield the maximum likelihood estimate
(MLE) of the probability of seeing σk, (2.4).
(2.2) The frequency σk
# (σk)
(2.3) The frequency of all segments, summed∑i
# (σi)
(2.4) The probability of σk
Pr(σk) ≈# (σk)∑i # (σi)
If frequency effects are extended from the reduction of words to the reduction of
segments, they can provide a good explanation for the differences among voiced stops
and voiceless stops. Table 2.3 shows the frequency of the nine stops across the Fisher
(Cieri et al., 2004, 2005), Switchboard Godfrey and Holliman (1997) and Buckeye
(Pitt et al., 2007) corpora of spoken English (see §2.8, side by side with their relative
duration and deletion ratios (as shown in tables 2.1 and 2.2). /b/ and /k/ are more
frequent than /g/ and /p/ respectively, and as predicted by the proposed frequency
account, their duration is shorter and they delete more frequently than /g/ and /p/.
However, this explanation does not account for the contrast between /m/ and /N/.
/m/ is more frequent than /N/, yet its duration is longer, and its relative deletion
ratio is lower than the deletion ratio of /N/.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 18
Table 2.3: Buckeye word-medial stop duration, deletion and probability
Place Voiceless Stops Voiced Stops Nasal Stopsσ Dur. Del. Prob. σ Dur. Del. Prob. σ Dur. Del. Prob.
Labial /p/ 1.123 0.013 0.015 /b/ 0.805 0.113 0.017 /m/ 0.881 0.025 0.035Dorsal /k/ 1.032 0.020 0.031 /g/ 0.829 0.054 0.010 /N/ 0.773 0.046 0.011Coronal /t/ 0.775 0.160 0.074 /d/ 0.587 0.175 0.039 /n/ 0.773 0.072 0.062
2.3.4 Local predictability accounts
A local predictability-based account takes the “best guess” story one step further.
Consider the case of the -tion suffix in English in the word ‘explanation’. When
speakers reach the sequence [S@] they already know that [n] will follow. In other words,
hearing [n] after the sequence of segments that precede it does nothing except confirm
the speaker’s expectations that [n] will follow. In itself, /n/ provides no information.
It is reasonable, therefore, that speakers could delete it, without harming the listener’s
ability to comprehend the word.7 In information theoretic terms, hearing /n/ does
not provide any new information.
I will define the amount of information a segment σ provides in the context c
that it appears in using conditional probability as in (2.5). In order to transform the
conditional probability of seeing σ in the context c to amounts of information, I take
the negative log of (2.5) to yield (2.6), which is measured in bits. In the example
above, listeners are positive that [n] is going to follow, which makes their estimate of
Pr (n|[#Ekspl@neIS@) equal 1, and the amount of information provided by seeing /n/
in that context zero, as the log of 1 is 0 – no information is gained.
The same principle applies when seeing a segment in some context is very improb-
able or very probable. The higher the conditional probability of seeing a segment in
its context is, the less information it provides, and vice versa. From the functional
perspective, speaker can reduce or omit predictable segments with less harm to their
listeners, who can easily recover the reduced or omitted segments.
7In order to perform this kind of accommodation, the speaker is not required to have a completeknowledge of the listener’s state of mind. Minimal accommodation models in which speakers modelthemselves as listeners suffice.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 19
(2.5) The conditional probability of seeing σ in context c
Pr (σ|c)
(2.6) The negative log probability of seeing σ in context c
− log2 Pr (σ|c)
Different theories may define the context c differently or not use it at all. It is pos-
sible to set c to provide varying levels of information. The simplest approach is to set c
not to provide any information. This is a simple transformation of segment probabil-
ity often labeled uniphone (2.7). A common approach, taken for instance in Raymond
et al. (2006), is to take one or two preceding segments as the context, yielding specific
measurements often labeled biphone (2.8) and triphone (2.9) respectively. Biphones
and triphones approximate a very local context that can approximate phonotactics
to a certain degree.8 Another approach, taken in van Son and Pols (2003), is to try
to approximate the amount of information that is associated with a segment at a
word-prediction level, by taking as context all the preceding segments (2.10). This
measurement is the one used above to describe the case of the final /n/ in the word
‘explanation’.
(2.7) Uniphone
− log2 Pr (σ)
(2.8) Biphone
− log2 Pr (σi|σi−1)
(2.9) Triphone
− log2 Pr (σi|σi−2σi−1)
(2.10) Information given all previous segments
− log2 Pr (σi|σ0 . . . σi−1)
8In English /Ä/ follows unstressed vowels in very few words, making words such as adventureratypical of English phonotactics.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 20
In order to estimate the probability of seeing a segment in context, it is common to
use counts, which uses the same maximum likelihood approach that was used above
for calculating segment probabilities. The probability of seeing a phone in context is
estimated to be (2.11), the number of times the speakers encountered σ in the context
c over the number of times they encountered the context c.
(2.11) Estimate for conditional probability of seeing σ in the context of c
Pr (σ|c) ≈ #(σ, c)
#(c)
Local predictability in its various forms has been shown to affect linguistic per-
formance. Aylett and Turk (2006) show that syllable nuclei are shorter when they
are locally predictable from context. Similar studies demonstrate the same for other
levels of linguistic representation, such as consonants, morphemes and words (Pluy-
maekers et al., 2005; van Son and Pols, 2003; van Son and van Santen, 2005; Jurafsky
et al., 2001). Other studies link local predictability to syntactic planning (Levy and
Jaeger, 2007; Jaeger, 2010) and use it to provide a basis for markedness (Hume, 2008).
Adapting this view to segment duration and deletion ratios yields an expectation that
more predictable segments would have shorter duration and delete more frequently
than less predictable ones. This view is parallel to the view presented above for fre-
quent segments. Raymond et al. (2006) have shown such effects for the deletion of
word-medial /t/ and /d/ at syllable onsets (but surprisingly enough, the opposite
held in codas).9
Local predictability accounts for a broad range of phenomena, but it cannot be
completely correlated with the propensity of a segment to delete. While it is generally
true that segments are more likely to delete the more predictable they are, some
segments delete even when they are not locally predictable, while other segments do
not delete even when they provide no information. Consider the cases of /t/-deletion
in examples (2.12) and (2.13), and the cases of /d/-deletion in examples (2.14) and
(2.15) from the Buckeye corpus, where /t/ and /d/ delete even though they are rather
surprising.
9The experiments conducted in this study show no such reversal of predictability, but did notfocus on /t/ and /d/.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 21
(2.12) ‘notice’ → [noU@s]
(2.13) ‘battle’ → [bæl"]
(2.14) ‘sudden’ → [s@n"]
(2.15) ‘order’ → [6@~]
The deleted /t/s and /d/s are unpredictable if context is measured from the
beginning of the word as in (2.16).10 The same holds for a context of two previous
segments (triphone), as in (2.17).
(2.16) Word Prob.
‘notice’: Pr(t|[#noU) = 0.011
‘battle’: Pr(t|[#bæ) = 0.0002
‘sudden’: Pr(d|[#s@) = 0.007
‘order’: Pr(d|[#6r) = 0.00032
(2.17) Word Prob.
‘notice’: Pr(t|noU) = 0.011
‘battle’: Pr(t|bæ) = 0.00029
‘sudden’: Pr(d|s@) = 0.01
‘order’: Pr(d|6r) = 0.0008
On the other hand, /p/ is the only segment that appears after [w@~kS6] in ‘work-
shops’, and /m/ is the only segment that can appear after [@kæd@] in ‘academy’ 11,
and even locally they are not very surprising (2.18)
(2.18) Word Prob.
‘workshops’: Pr(p|S6) = 0.398
‘academy’: Pr(m|d@) = 0.048
Similarly, the corpus contains just a single case of /m/-deletion out of 286 instances
of /m/ in words that begin with ‘home’, even though /m/ follows hoU in a ratio of 2:5,
10The # symbol stands for a beginning of a word.11That is, they are fully recoverable from previous context.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 22
and despite the high frequency of the word ‘home’.12 Local predictability does not
predict the existence of /t/ and /d/ deletion processes in English, and the absence of
/p/ and /m/ deletion processes (though they still delete, but at lower rates; see table
2.2).
That predictability does not always coincide with reduced duration and higher
deletion rates is further demonstrated by the fact that deletion processes may re-
move segments that do carry information, contrary to expectation. Consider the
diachronic case of the deletion of French plural markers that led to the conflation of
the plural form of words such as pommes with their singular forms (pomme), or the
case of Puerto Rican Spanish (Hochberg, 1986) where /s/-deletion removes agreement
markers and conflates second and third person verb forms. While such local loss of
information may lead to compensation elsewhere (for instance, it may increase the
use of pronouns), it is hard to claim that deletion processes in language only serve to
improve communication.
Neither frequency-based explanations nor local predictability effects can adequately
explain the duration and deletion ratios of English stops. Frequency accounts cannot
explain why /N/’s duration is shorter and why it deletes more than /m/, and local
predictability accounts do not explain why unpredictable /t/s delete more frequently
than predictable /p/s. The next section will propose a way to bridge the gap be-
tween the frequency and local predictability accounts by introducing the concept of
segment informativity. Like frequency, segment informativity does not depend on lo-
cal context, and like local predictability, it emphasizes the importance of the varying
“usefulness” of various segments from an information theoretic perspective.
2.4 Informativity
There is a tension between frequency-based explanations and local predictability ac-
counts. Local predictability accounts view phonological changes as driven by func-
tional concerns: speakers delete or reduce uninformative elements. Frequency, in this
view, is only an approximation due to lack of knowledge about the context. When
speakers do not know what the context is, frequency functions as zero context “local”
12For discussions of reduction in frequent words see Bell et al. 2009.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 23
predictability. But careful examination of the data shows that frequent segments may
also delete in cases where they remove information that is not locally reconstructible,
as is the case with /s/-deletion in Puerto Rican Spanish. Similarly, frequency-based
accounts fail to predict that /N/, which is usually very predictable from prior context
in English, will delete as frequently as it does, since it is a relatively infrequent seg-
ment. In other words, what both accounts fail to capture is that predictable segments
behave as predictable even when they are not: they “carry over” their being usually
predictable to contexts where they are unpredictable. This is the gap that segment
informativity tries to bridge.
I propose that language users record how useful and informative a segment usually
is in the language. This knowledge then forms an expectation of the utility of each
segment, which in turn affects the duration and deletion ratios of those segments.
This model predicts that less informative segments will have shorter durations and
be more likely to delete even when they are unpredictable given local context, other
things being equal, and that more informative segments will have longer durations
and be less likely to delete even when they are predictable given local context. Thus,
/d/ in ‘sudden’ may be deleted because /d/ is expected to provide less information,
even though it is locally unpredictable (and does provide information), as is the case
in (2.14), repeated here as (2.19).
(2.19) ‘sudden’ → [s@n"]
To approximate segment informativity, the segment’s negative log predictability
given some definition of context is taken (2.20), and averaged across every case in
which that segment appeared with any context, by summing over contexts, with each
context weighted by its co-occurrence with that segment (2.21). This averaging yields
the expected value of that segment’s negative log predictability.
(2.20) The local predictability of σ in context c
− log2 Pr (σ|c)
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 24
(2.21) The informativity of σ
−∑c
Pr (c|σ) log2 Pr (σ|c)
In this chapter I set local context to be every preceding segment in the same
word, adopting the view presented in van Son and Pols (2003), which points to word
recognition as the relevant task for the contribution a segment makes. When matched
against segment duration and deletion ratios, the different deletion hierarchies across
different stop types are approximated by each segment’s informativity: /p/ and /m/
are more informative and delete less than /k/ and /N/, but /b/ is less informative
and deletes more than /g/, as table 2.4 shows. Informativity was calculated using
the same corpora used to calculate segment frequency (see §2.8).
Table 2.4: Buckeye word-medial stop duration, deletion and informativity
Place Voiceless Stops Voiced Stops Nasal Stopsσ Dur. Del. Info. σ Dur. Del. Info. σ Dur. Del. Info.
Labial /p/ 1.123 0.013 3.656 /b/ 0.805 0.113 3.923 /m/ 0.881 0.025 2.437Dorsal /k/ 1.032 0.020 2.261 /g/ 0.829 0.054 4.693 /N/ 0.773 0.046 0.276Coronal /t/ 0.775 0.160 1.357 /d/ 0.587 0.175 1.632 /n/ 0.773 0.072 1.720
The informativity of segments predicts the asymmetry in the duration and deletion
rates of English stops. However, much of what informativity explains has been ad-
dressed before by current accounts. Local predictability may explain why /N/ deletes
more than /m/, which none of the other attempts could achieve. Moreover, seg-
ment informativity is highly correlated with other factors such as segment frequency,
segment local predictability and place of articulation. How can we tell whether infor-
mativity is significant in its own right given other explanatory variables? To ascertain
its contribution it is necessary to control for phonetic, phonological and other infor-
mation theoretic factors. In the next sections I test the effect informativity and other
information theoretic measurements have on segment duration and deletion ratios
using multivariate regression studies.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 25
2.5 Segment duration and deletion studies
2.5.1 Studies overview
The puzzle of stop duration and deletion ratios is a convenient example for demon-
strating that the durations and occasional deletion ratios of segments are systematic –
some segments have shorter duration and delete more frequently than other segments.
Moreover, the observed durations and deletion ratios show that systematic duration
and deletion ratios of American English segments cannot be explained using only
phonetic explanations, as less audible segments such as /p/ (Ohala, 1981) delete less
frequently than the more audible voiceless stop /k/. Similarly, the duration of /g/ is
longer than the duration of /b/, even though it is more difficult to maintain voicing
for dorsal stops. Accounts which are based on segment frequency and segment local
predictability do not predict the observed durations and deletion ratios, as /m/ is
more frequent than /N/, but has longer duration and deletes less frequently than /N/,
and unpredictable /t/ and /d/ delete even when they are unpredictable in context,
while predictable segments do delete.
But in order to know how each factor affects segment duration and deletion ratios
it is necessary to weigh the effect each predictor has against the possible effect of other
factors. If informativity, frequency and local predictability have the expected effect
even after other factors have been controlled for, it would validate their explanatory
power beyond the American English stops puzzle.
Cohen Priva (2008) studied the residual effect informativity has on the deletion of
word-medial onsets and codas while controlling for other phonological, phonetic and
information theoretic factors. The studies were performed only on part of the Buckeye
corpus (Pitt et al., 2007), and suffered from multiple collinearities. Additionally, those
studies were not applied to the effect segment informativity has on the duration of
word-medial segments (though some of the segment duration data was explored in
Cohen Priva and Jurafsky 2008).
The following studies are designed to complement and address the shortcomings
of the previous studies in several ways:
• The information theoretic measurements are measured using a significantly
larger collection of spoken English (see §2.8).
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 26
• Data from all forty speakers in the Buckeye corpus is used to measure seg-
ment duration and deletion ratios, whereas previous studies used only twenty
speakers.
• The duration of segments is measured alongside deletion ratios.
• The collinearities between information theoretic measurements is removed.
• Raymond et al. (2006) and Cohen Priva (2008) tested the effect various factors
have on onsets and codas separately. This chapter takes a different approach.
Since the concepts ‘coda’ and ‘onset’ group together different phonological en-
vironments, this chapter studies the role of information theoretic variables in
intervocalic and postvocalic preconsonantal positions. Thus, there are fewer
phonological features to control for in each data set, and risk of collapsing the
difference between different phenomena is reduced.
The two environments are subject to different weakening pressures. Delet-
ing postvocalic preconsonantal segments simplifies syllable clusters by chang-
ing VCCV sequences to VCV sequences, simplifying CCV(C) syllables to less
marked CV(C) syllables and CVC syllables to unmarked CV syllables. In con-
trast, deleting intervocalic consonants complicates syllable structure as it cre-
ates marked onset-less syllables. Intervocalic consonants are subject to lenition
processes such as spirantization which do not always affect postvocalic positions
(American English tapping is one such case, Kahn 1976).
• Liquids and glides are excluded as there are too few of them for the number of
controls that are required to describe the differences among them. Therefore
the chance of overfitting the data is decreased.
2.5.2 Intervocalic consonant duration
Introduction As summarized above, intervocalic positions are often the locus of
various lenition processes such as spirantization and sonorization. In American En-
glish one such outcome is tapping. However, it is not a typical environment for
deletion, as the data set used in these study shows: 4.2% of the segments in inter-
vocalic positions delete, while 5.9% of the segments in postvocalic preconsonantal
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 27
positions delete (Fisher’s Exact Test, p < 0.001). Intervocalic positions are therefore
a good test case to investigate what effect information theoretic variables have on
segment duration, and subsequently on segment deletion.
Method and materials The underlying representation of every word in the Buck-
eye corpus was matched with its actual pronunciation as described in §2.7. The
duration of matched segments was recorded, and kept alongside the word they were
part of and their phonological environment. Rate of speech was calculated as the
number of underlying segments per second. I excluded all deleted segments and seg-
ments with unusually long or short duration (top and bottom 2.5% of each segment).
This procedure yielded 27,353 observations.
I used linear regression to calculate the log duration of segments. This means that
the regression attempts to fit the weights β0 . . . βn in the formula in (2.22), where y is
a vector of observed durations, xi..n are vectors of predictors, and ε stands for possible
noise. The formula can be further simplified to (2.23) and (2.24). The formula in
(2.24) shows that the regression is a regression of multipliers – a binary feature whose
coefficient equals 0.1 does not indicate that the duration is 0.1 milliseconds longer
when this feature is ‘true’, but rather that the duration is e0.1 times as long when
the feature is ‘true’ than it would have been had this feature been ‘false’. A zero
coefficient therefore would not affect the duration since e0 = 1. Significance is tested
by estimating how likely it is that the coefficient of a variable is really positive or
negative (that it is not zero).
(2.22)
log (y) ≈ β0 + β1x1 + β2x2 + . . .+ βnxn + ε
(2.23)
elog(y) ≈ eβ0+β1x1+β2x2+...+βnxn+ε
(2.24)
y ≈ eβ0 · eβ1x1 · eβ2x2 · . . . · eβnxn · eε
I used the phonological control variables in table 2.5 to control for the base proper-
ties of the segment in question. In addition, I used the phonological control variables
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 28
in 2.6 to control for the properties of the neighboring vowels and word-level variables.
Phrases were taken to be the Buckeye Corpus speakers’ turns (as divided by the
Buckeye Corpus annotators).
Table 2.5: Segment properties
Variable Value Segments
Manner stop /p, t, k, b, d, g/affricate /Ù, Ã/nasal /m, n, N/fricative /f, v, T, D, s, z, S, Z, h/liquid /l, r/glide /w, j/
Place glottal /h/labial /p, b, v, f, m/dorsal /k, g, N, j/coronal all others
Subplace dental /T, D/post-alveolar /Ù, Ã, S, Z/∅ all others
Voicing voiced (binary) /b, v, D, d, z, Ã, Z, g/
I used the step() function (Hastie and Pregibon, 1992; Venables and Ripley,
2002) in R (R Development Core Team, 2012) to allow the best non information
theoretic model to be chosen automatically, and then added four information theo-
retic variables of interest: word frequency, segment probability (uniphone), segment
informativity and the local predictability of the segment, all defined in table 2.7, and
estimated from spoken corpora following a procedure detailed in §2.8. It is important
to note that informativity is residualized using uniphone as the baseline and that
local predictability is residualized using both uniphone probability and informativ-
ity.13 Thus, these factors will only be significant if they improve the model beyond
13This means that residual informativity is the original value of informativity minus an approxi-mation of informativity using frequency in a linear regression of the form informativity(segment) ≈intercept + uniphone(segment), and local predictability is similarly residualized using a formula ofthe form negative log predictability(segment in context) ≈ intercept + frequency(segment) + infor-mativity(segment).
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 29
Table 2.6: Environment properties
Variable ValueStress neighboring vowel has primary, secondary or no stressPOS duration the median duration of all segments with the same part of
speechPhrase distance distance from the end of the phrase in words, loggedStart position distance from the beginning of the word in segments, loggedEnd position distance from the end of the word in segments, logged
the (unconstrained) effect variables they are residualized over have.
Table 2.7: Variables of interest
Variable ValueWord frequency the frequency of the word, loggedSegment probability the negative log unigram probability of observing
the segment (2.7)Segment informativity the informativity of the segment (2.21) using all
earlier segments in the same word as context resid-ualized using segment probability
Segment localpredictability
the negative log local predictability of the segment(2.10) using all earlier segments in the same wordas context residualized using segment probabilityand segment informativity
The model was reevaluated using a mixed effects model with the identity of
the word as a random effect using R’s lme4 package (Bates et al., 2011). The
pMCMC-values reported below were evaluated using the function pval.fnc() from
R’s languageR package (Baayen, 2011). pMCMC values are computed using Markov
chain Monte Carlo sampling of the data and are more conservative than p values. See
Baayen et al. (2008) for further details.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 30
Results All four variables of interest affected the duration of intervocalic conso-
nants. High word frequency predicted shorter segment duration (pMCMC < 0.001),
as predicted by many previous studies (Zipf, 1949; Bell et al., 2009, among oth-
ers). High uniphone score (negative log segment probability) was correlated with
longer duration (pMCMC < 0.001). Similarly, residual informativity was correlated
with longer duration (pMCMC < 0.001), and so was residual local predictability
(pMCMC < 0.001).14
Among the control variables, the duration of segments that followed stressed vow-
els was shorter (pMCMC < 0.001), but the duration of segments that preceded
stressed vowels were longer (pMCMC < 0.001), an asymmetry that is explained by
American English tapping patterns – /t,d/ tap in intervocalic contexts that follow
stressed syllables and precede unstressed syllables (Kahn, 1976), and the duration of
taps ([R]) is shorter than the duration of [t] and [d]. The duration of segments was
significantly affected by their part of speech (pMCMC < 0.01). Finally, segments
were shorter the further they were from phrase-final position, as predicted by end-of-
phrase lengthening. For a complete list of control variables and their effect, please
see §2.9.1.
Discussion The results show the role of information theoretic measurement in af-
fecting the duration of segments. After controlling for phonological features such as
place of articulation and phonetic properties such as rate of speech, the duration of
segments that had high uniphone score (low probability), high informativity and that
were unpredictable in their context was relatively longer. The durational modulation
of segments is affected by both segment-level information theoretic factors such as
informativity and by word-level information theoretic factors such as word frequency.
These results establish the importance of segment informativity as they show
that informativity affects not only deletion ratios, but also the duration of segments.
Reduced duration may in turn lead to deletion, which is the focus of the following
section.
14 Word frequency is measured in log number of words in several corpora. Higher frequency yieldshigher log frequency. Frequent, predictable and low-informativity segments provide less informationand therefore have lower uni/bi/triphone or local predictability or informativity scores. They aremeasured using negative log (conditional) probability.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 31
2.5.3 Intervocalic segment deletion
Introduction The previous study shows that the duration of segments is affected
by information theoretic factors. But diminished duration does not have to lead to
deletion. Deletion may happen when the duration of a segment is reduced beyond
the minimal duration that would allow the articulators to pronounce it. For different
segments the threshold beyond which deletion would occur may be different. Deletion
may therefore be independent of durational modulation, and independently driven.
Another important property of deletion processes is that durational modulation
is not considered to be part of competence grammar.15 Though occasional deletion of
segments is not considered to be part of competence grammar either, it is very likely
that what eventually becomes part of competence grammar begins as frequent occa-
sional deletion that subsequently get grammaticalized and preserved in the grammar.
Thus, though word-medial /t/-deletion is not part of English grammar, word-final
/t/-deletion is allowed in many English dialects. In contrast, similar processes such
as k-deletion are not allowed in either environment. It is not far-fetched to claim
that the occasional deletion of word-medial /t/ is related to word-final deletion, and
perhaps driven underlyingly by similar factors.
It is therefore important to verify that information theoretic factors affect the
deletion ratios of segments in the same environment in which durational effects were
found.
Method and materials The corpus used in this study was aligned using the same
procedure described in the previous study and in §2.7. The control variables used in
the previous study were used in this study as well. I used the phonological control
variables in table 2.5 to control for the base properties of the segment in question.
Finally, I used the phonological control variables in 2.6 to control for the properties
of both vowels and word-level variables. Since deleted segments were not excluded
the same procedure yielded 30,052 observations.
I used logistic regression to estimate how each factor affects the likelihood of seg-
ments to delete. The best model that excludes information theoretic factors was
15Several exceptions to this generalization do exist. Consonant and vowel duration is contrastivein many languages. This set of studies focuses on the duration of obstruents, which is not contrastivein English.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 32
chosen using the step() function in R. Subsequently, I added the same four informa-
tion theoretic variables of interest, as defined in table 2.7, and estimated from spoken
corpora using the CMU dictionary (Weide, 1998) for phonemic (underlying) repre-
sentation and the Switchboard (Godfrey and Holliman, 1997), Fisher (Cieri et al.,
2004, 2005) and Buckeye (Pitt et al., 2007) corpora for word counts.16 The model
was reevaluated using a mixed effects model with the identity of the word as a ran-
dom effect using R’s lme4 package (Bates et al., 2011). pMCMC-values (Baayen
et al., 2008) are not reported here since lme4 does not allow them to be computed
for logistic regressions. Instead, p-values are reported.
Results As expected, high word frequency increased the likelihood to delete (p
< 0.001), and high informativity score decreased the likelihood to delete. Uniphone
segment probability (its negative log probability) did not affect the segment’ likelihood
to delete, and neither did its residual contextual predictability.
Among the control variables, previous stress did not affect the segments’ likelihood
to delete but the following stress did – segments that were followed by stressed vowels
were less likely to delete (p < 0.001). The duration of segments was significantly
affected by their part of speech (p < 0.001).17 Finally, segments in words that were
more distant from phrase-final positions were more likely to be deleted. (p < 0.001).
For a complete list of control variables and their effect, please see §2.9.2.
Discussion The lack of effect for the segment’s uniphone probability and local pre-
dictability may be due to the greater number of observations required in a logistic
regression than in a linear regression, which may also have caused the effect of pre-
ceding stressed vowel to disappear. However, there may be other reasons. As I noted
before, deletion and reduction in duration do not necessarily have identical causes. It
is therefore important that in both the duration study and the deletion study, both
the amount of information the segment carries and the amount of information the
word carries affected segment duration and deletion in the expected direction: the
16For further details, see §2.8.17The coefficient for part of speech has to be ≥ 0. The variable contains for each segment the log
mean duration of segments that have the same part of speech. If the part of speech were irrelevantto segment duration, the coefficient would have been close to 0.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 33
more information the word or the segment held, the longer their duration was and
the less likely they were to be deleted.
2.5.4 Postvocalic segment duration
Introduction The goal of this study and the subsequent study is to verify that
the information theoretic effects found for intervocalic consonants persist in other
environments. This environment replaces the coda environment used in Raymond
et al. (2006), Cohen Priva (2008) and Cohen Priva and Jurafsky (2008), even though
some of the segments in such positions are actually the first consonant in a complex
onset. For instance, CELEX (Baayen et al., 1995) treats the /s/ in estrange and the
/p/ in appreciate as the first consonant of the second syllable in both words.
The expectation is that the correlation between duration and information that was
observed in intervocalic context would be replicated in postvocalic preconsonantal
environment as well.
Method and materials The procedure used in this study is very similar to the one
used in the intervocalic duration study described in §2.5.2. The procedure described in
§2.7 was used to align the dictionary representations of the Buckeye corpus with their
actual pronunciation. Segment duration, rate of speech and phonological properties
were collected as detailed above. I excluded all deleted segments and segments with
unusually long or short duration (top and bottom 2.5% of each segment). This
procedure yielded 35,081 data points.
As in the previous duration study, I used linear regression to calculate the log du-
ration of segments. I used the phonological control variables in table 2.5 to control for
the base properties of the segment in question. I used the same phonological control
variables for the following consonant, as well as a variable that indicates whether the
two consonants share a place of articulation. Finally, I used the phonological control
variables in 2.6 to control for the properties of the preceding vowel and word-level
variables.
I used the step() function to allow the best non information theoretic model to be
chosen automatically, and then added the same four information theoretic variables
of interest: word frequency, segment probability (uniphone), segment informativity
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 34
and the local predictability of the segment, all defined in table 2.7, and estimated
from spoken corpora following a procedure detailed in §2.8.
The model was reevaluated using a mixed effects model with the identity of the
word as a random effect using R’s lme4 package. The pMCMC-values reported below
were evaluated using the function pval.fnc() from R’s languageR package.
Results The results were similar to the intervocalic consonant duration case. All
four variables of interest affected the duration of intervocalic consonants in the pre-
dicted direction. High word frequency predicted shorter segment duration (pMCMC
< 0.001), high uniphone score was correlated with longer duration (pMCMC <
0.001), and so was informativity (pMCMC < 0.001) and local predictability (pMCMC
< 0.05).
The control variables followed a similar pattern to the intervocalic duration study.
The duration of segments that followed stressed vowels was shorter (pMCMC <
0.001). The duration of segments was significantly affected by their part of speech
(pMCMC < 0.001). Segments were shorter the further they were for phrase-final
position, as predicted by end-of-phrase lengthening. For a complete list of control
variables and their effect, please see §2.9.3.
Discussion The results provide further support for the intervocalic consonant du-
ration study. The same variables had an identical effect, which shows that the corre-
lation between duration and information was not environment-specific, but rather a
fundamental property of American English segments.
2.5.5 Postvocalic segment deletion
Introduction As with the intervocalic case, it is not necessary that reduced du-
ration would lead to deletion. As I argued above, high ratios of occasional deletion
patterns are arguably a necessary step before optional and obligatory deletions are
encoded in speakers’ competence grammar.
I replicate the intervocalic study of consonant deletion in postvocalic preconsonan-
tal positions to see whether in this environment, too, high information is correlated
with reduced likelihood to delete.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 35
Method and materials The corpus used for this study is identical to the one used
for the duration study in the same environment, except that deleted segments were
not excluded. The control variables used in the previous study were used in this
study as well. I used the phonological control variables in table 2.5 to control for
the base properties of the segment in question. I used the same phonological control
variables for the following consonant, as well as a variable that indicates whether
the two consonants share a place of articulation. Finally, I used the phonological
control variables in 2.6 to control for the properties of the preceding vowel and word-
level variables. Since deleted segments were not excluded the same procedure yielded
39,265 observations.
I used logistic regression to estimate how each factor affects the likelihood of seg-
ments to delete. The best model that excludes information theoretic factors was
chosen using the step() function in R. Subsequently, I added the same four informa-
tion theoretic variables of interest, as defined in table 2.7. The model was reevaluated
using a mixed effects model with the identity of the word as a random effect using
R’s lme4 package. pMCMC-values are not reported here since lme4 does not allow
them to be computed for logistic regressions. Instead, p-values are reported.
Results Three of the information theoretic variables affected segments’ likelihood
to be deleted. High word frequency predicted higher probability to be deleted (p <
0.001), high uniphone score decreased the likelihood to be deleted (p < 0.05), and
high informativity decreased segments’ likelihood to be deleted (p < 0.05). However,
there was no residual effect for local predictability.
The control variables followed a pattern similar to the previous study. Segments
that followed a vowel with primary stress were more likely to be deleted (p < 0.001)
than those that followed unstressed vowels, but those that followed vowels with sec-
ondary stress were not different from unstressed vowels. Segments’ likelihood to be
deleted was significantly affected by their part of speech (pMCMC < 0.001). Seg-
ments were more likely to be deleted the further they were from phrase-final position.
For a complete list of control variables and their effect, please see §2.9.4.
Discussion Segment deletion in postvocalic positions followed a similar pattern to
segment duration in the same environment for both the information theoretic factors
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 36
and the various controls. The repeated similarity to duration studies suggests that
the effect information theoretic variables have on segment duration and deletion ratios
is consistent. High amount of information at both the word level and segment level
leads to increased duration and preservation whereas low amount of information is
associated with reduction in duration and ultimately deletion.
The significance of most variables was lower across the board compared with the
previous study. The absence of significance for some variables may be due to the
loss of predictive power in logistic regressions and not due to fundamentally different
factors.
2.6 Conclusion
Both the theoretical analysis of oral and nasal stop deletion ratios in English and
the data-driven experiments of medial consonant deletion demonstrate the appeal
of an approach that uses a context-independent measurement for phone usefulness
in explaining the typology of consonant deletion processes. However, this out-of-
context measurement emerges from contextual considerations: speakers are biased by
how useful a phone usually is, regardless of the context in which it appears. There
is no clear functional justification for using this aggregate instead of the in-context
predictability, which suggests that informativity becomes part of the knowledge kept
about each phone. This provides us with a relatively rare view into the relationship
between functional and non-functional considerations in human language. Functional
considerations (in this case local predictability) shape a mental representation (in this
case phone informativity), which is then reflected back in language usage.
Previous accounts cannot provide a comprehensive explanation for the in-language
typology of consonant deletion. The different durations and deletion ratios of Ameri-
can English segments cannot be explained by phonetic or phonological reasons such as
markedness or articulatory and perceptual biases. Frequency and local predictability
predict some of the variance that phonological theories do not account for, but do
not explain why phones can be both infrequent and likely to delete, or completely
predictable yet stable. At the same time, previous information-theoretic measures
such as local predictability cannot be dismissed, as the various models supported the
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 37
use of local predictability and segment frequency. Informativity complements current
accounts by providing a mechanism that accounts for the shortcomings of current
theories.
This chapter revisits and expands on the findings of Cohen Priva (2008) and Co-
hen Priva and Jurafsky (2008). It establishes the importance of information content
and in particular segment informativity to performance-related phonetic and phono-
logical phenomena, the duration and deletion ratios of segments. As already hinted
in this chapter, the next step is to see how the same factors lead to competence-based
phenomena such as the actuation of phonological processes, and subsequently how
such processes are encoded in the lexicon and reflected in usage preferences.
2.7 Segment Alignment Process
The Buckeye corpus (Pitt et al., 2007) provides for each word several values: the
speaker of the word, the duration of the word, the word in English (2.25), the dictio-
nary (idealized) phonetic form (2.26), the word’s actual pronunciation (2.27) and its
part of speech (2.28)
(2.25) notice: notice
(2.26) n ow t ih s: /n>oUtIs/
(2.27) n ow ah s: [noU@s]
(2.28) VBP: present tense verb, not 3rd person
Part of the challenge in using the corpus is to have a disciplined way to under-
stand that in (2.27), /t/ was dropped by the speaker, that /I/ surfaced as [@], and
that the other segments remained unchanged. One way to align two strings together
is to minimize a metric of edit-distance between the two strings, and use the list of
edits that yielded the minimal edit-distance. One such way is to minimize the Lev-
enshtein distance (Levenshtein, 1966). A simple calculation of edit distance between
strings requires that every operation of the list: substitution, deletion and insertion
is associated with an identical penalty. Thus, the (minimal) edit distance between
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 38
the strings “bests” and “guest” is 3: ‘b’ was replaced with ‘g’, ‘u’ was inserted and
‘s’ was deleted.18
It is not advisable to use equal penalties for the three edit operations because some
substitutions are phonologically plausible and other substitutions are not motivated.
The penalties for each insertion, deletion and substitution were therefore modified
to reflect phonological plausibility. I used the penalties in table 2.8 to align the
underlying representations with their surface forms. The results of the alignment
process (the number of each segment aligned with each other segment, excluding
vowels) can be found at the end of this section.
Table 2.8: Dictionary and surface alignment penalties
Source Target Penaltysegment same segment 0.0segment ∅ (deletion) 1.0∅ segment (insertion) 1.0vowel another vowel 0.3nasal another nasal 0.3dorsal stop another dorsal stop 0.3sibilant another sibilant 0.3/t,d,T,D/ another /t,d,T,D/ 0.3/l,l
"/ another /l,l
"/ 0.3
/r,Ä/ another /r,Ä/ 0.3/t,d,P,R/ another /t,d,P,R/ 0.3/p,b,f,v/ another /p,b,f,v/ 0.3/m,b/ another /m,b/vowel non-vowel 5.0non-vowel vowel 5.0
In order to get reliable predictability scores, the Switchboard (Godfrey and Hol-
liman, 1997) and Fisher (Cieri et al., 2004, 2005) corpora were used to provide word
counts in addition to the Buckeye corpus. For many of these words the Buckeye
18If two routes have the same penalty, it is not defined which one is better. The (minimal) edit-distance between “ab” and “bc” is 2, but there are two ways to get to that value: substitute ‘b’ for‘a’ and ‘c’ for ‘b’ (1+1) or delete ‘a’ and insert ‘c’ (1+1). In these cases the algorithm arbitrarilydecides between the two.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 39
corpus did not provide dictionary representations and the CMU dictionary was used
instead. The Buckeye dictionary representation is similar to the CMU representation
but they are not identical. The substitutions in table 2.9 were allowed, and the word
was allowed to have its CMU representation. Other substitutions meant that the
word was excluded from the data.
Table 2.9: Buckeye to CMU valid substitution
CMU Buckeyeany vowel any vowel/s/ /S//S/ /s//s/ /z//z/ /s//t/ /d//d/ /t//ô/ /Ä//Ä/ /ô/any vowel + /l,m,n,ô/ /l
",m",n",Ä/ respectively
Finally the word-level files listed above had to be aligned with the segment-level
files which contained segment duration. Segments in either file that did not have an
equivalent in the other file were removed from the data.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 40
phone aa ae ah ao aw ay b ch d dh dx eh ehn el en er
b 0 0 0 0 0 0 3877 0 0 0 0 0 0 0 0 0
ch 0 0 0 0 0 0 0 634 0 0 0 0 0 0 0 0
d 0 0 0 0 0 0 1 1 2683 10 1361 0 0 0 0 0
dh 0 0 0 0 0 0 0 0 0 1282 29 0 0 0 0 0
f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
hh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
jh 0 0 0 0 0 0 0 15 1 0 1 0 0 0 0 0
k 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
l 0 0 0 0 0 0 0 0 0 1 1 0 0 345 0 0
m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
n 0 0 0 0 0 0 0 0 4 0 1 0 0 0 648 0
ng 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
p 0 0 0 0 0 0 112 0 0 0 0 0 0 0 0 0
r 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 364
s 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0
sh 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0
t 0 0 0 0 0 0 1 434 516 2 2820 0 0 0 0 0
th 0 0 0 0 0 0 3 0 5 137 3 0 0 0 0 0
v 0 0 0 0 0 0 10 0 0 0 1 0 0 0 0 0
w 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
z 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
zh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
phone f g hh jh k l m n ng nx p r
b 0 0 0 0 0 0 11 0 0 0 11 2
ch 0 0 0 0 1 0 0 0 0 0 0 0
d 0 2 0 15 0 1 2 10 0 2 0 2
dh 0 0 0 0 0 1 0 0 0 0 0 0
f 1862 0 0 0 0 0 0 0 0 0 4 0
g 0 1446 0 0 18 0 0 0 3 0 0 0
hh 0 0 324 0 0 0 0 0 0 0 0 0
jh 0 0 0 760 0 0 0 0 0 0 0 0
k 0 96 4 0 8208 0 0 0 0 0 0 0
l 0 0 0 0 0 10415 1 2 0 1 0 1
m 0 0 0 0 0 0 5226 14 0 8 2 0
n 0 0 0 0 0 1 100 17020 42 3082 0 2
ng 0 1 0 0 1 0 0 62 2777 1 0 0
p 13 0 0 0 0 0 0 0 0 0 4680 0
r 0 0 1 1 0 1 0 0 0 0 0 12763
s 0 0 0 0 1 0 0 0 0 0 0 1
sh 0 0 0 0 0 0 0 0 0 0 0 0
t 0 0 1 1 0 1 0 5 0 3 2 2
th 3 0 0 0 0 1 1 3 0 0 3 0
v 36 0 0 0 0 0 0 0 0 0 0 2
w 0 0 0 0 0 0 0 0 0 0 0 0
y 0 0 0 0 0 0 0 0 0 0 0 0
z 0 0 0 0 0 0 0 0 0 0 0 0
zh 0 0 0 0 0 0 0 0 0 0 0 0
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 41
phone s sh t th tq v w y z em zh
b 0 0 0 0 0 91 5 0 0 0 0
ch 0 103 4 0 0 0 0 1 0 0 1
d 0 0 23 0 11 0 0 0 0 0 2
dh 0 0 0 2 0 0 0 0 0 0 0
f 0 0 0 0 0 11 0 0 0 0 0
g 0 0 0 0 0 0 0 1 0 0 0
hh 0 0 0 0 0 0 0 0 0 0 0
jh 0 8 0 0 0 0 0 0 3 0 50
k 0 0 1 0 1 0 0 0 0 0 0
l 0 0 0 1 0 0 10 0 0 0 0
m 0 0 0 0 0 0 1 0 0 180 0
n 1 0 35 0 2 0 0 0 0 6 0
ng 0 0 0 0 0 0 0 0 0 0 0
p 0 0 0 0 1 31 0 0 0 0 0
r 1 0 0 0 0 0 0 1 0 0 1
s 9141 213 2 0 0 0 0 0 107 0 8
sh 4 1513 0 0 0 0 0 0 2 0 4
t 7 13 6537 2 576 0 0 1 3 0 1
th 0 0 44 1181 20 0 0 0 1 0 0
v 0 0 0 0 1 4092 2 0 0 0 0
w 0 0 0 0 0 0 1658 0 0 0 0
y 0 0 0 0 0 0 0 1089 0 0 0
z 311 2 0 0 0 0 0 0 1617 0 14
zh 0 4 0 0 0 0 0 0 2 0 84
2.8 Calculating information theoretic measurements
In order to calculate the frequency, predictability and informativity of each segment,
I used several corpora of spoken American English. I collected word counts from the
Switchboard (Godfrey and Holliman, 1997), Fisher (Cieri et al., 2004) and Buckeye
(Pitt et al., 2007) corpora. Each word was assumed to have its phonetic representation
in the CMU dictionary (Weide, 1998). The following information theoretic variables
were assessed using maximum likelihood estimates from the corpora.
• Word frequency is the number of times the word appeared in the corpora.
• Segment probability is the number of times the segment appeared in the dictio-
nary representation of a word that appeared in the corpora (ignoring deletions,
lenitions etc.). The negative log (base 2) of segment probability is taken. This is
the number of bits of information that the segment holds if no other information
is known.
• Segment predictability is the number of times a segment appeared following
the segments that precede from the beginning of the word, over the number
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 42
of occurrences that the segments that precede it appeared with any following
segment (or ended without any segment following). In (2.29) the corpus contains
just three words. The predictability of /s/ in the word talks is 0.25, as /s/ follows
the talk- prefix once for every four occurrences of the prefix. The negative log
(base 2) of predictability is taken. This is the number of bits of information
that the segment holds if no other information is known except the preceding
segments.
(2.29) Sample word counts:
talk 200
talks 100
talking 100
The predictability of /s/ in the word talks :
100
200 + 100 + 100= 0.25
The negative log predictability of /s/ in the word talks is − log2 0.25 = 2.
• Segment informativity is the weighted average of the negative log predictability
of all the occurrences of a segment. In (2.30) the corpus consists of six words,
and /s/ appears twice, once in the word talks in which it follows talk- with
probability of 0.25 (− log2 0.25 = 2), and once in the word walks in which
it follows walk- with probability of 0.5 (− log2 0.5 = 1). /s/ appears in talks
100 times, and in walks 300 times. The informativity of /s/ in this corpus is
(100 ∗ 2 + 300 ∗ 1) / (100 + 300) = 1.25.
(2.30) Sample word counts:
talk 200
talks 100
talking 100
walk 150
walks 300
walking 150
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 43
The predictability of /s/ in the word walks :
300
150 + 300 + 150= 0.5
The negative log predictability of /s/ in the word walks is − log2 0.5 = 1.
• Residual segment informativity : informativity is very collinear with frequency.
In order to remove that collinearity, segment informativity is residualized using
segment probability. Suppose that for all the observations, the informativity
of all segments is ~y and the (negative log) probability is ~x. A linear regression
is performed, which fits ~y ≈ a~x + b. The predicted value of this regression is
predicted (~y) = a~x+ b (the values of a and b are fitted to best fit the predictions
of the regressions to ~y). Rather than use ~y to approximate informativity in
subsequent regressions, ~y− predicted (~y) is used, thereby leaving only that part
of informativity which is not explained by segment probability.
• Residual segment predictability : both informativity and frequency are very
collinear with negative log predictability. In order to remove that collinearity,
negative log predictability is residualized using negative log probability and in-
formativity. Suppose that for all the observations, the negative log predictability
of all observations is ~z, the informativity of the observations is ~y and the negative
log probability is ~x. A linear regression is performed, which fits ~z ≈ a~x+ b~y+ c
(the values of a, b and c are fitted to best fit the predictions of the regressions
to ~z). The predicted value of this regression is predicted (~z) = a~x + b~y + c.
Rather than use ~z to approximate negative log predictability in subsequent re-
gressions, ~z − predicted (~z) is used, thereby leaving only that part of negative
log segment predictability which is not explained by informativity or negative
log probability.
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 44
2.9 Deletion and duration models
2.9.1 Intervocalic segment duration model
Factor Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(> |t|)Intercept -2.7052 -2.6631 -2.8509 -2.4613 0.0001 0.0000
rate of speech (log) -0.3285 -0.3304 -0.3483 -0.3126 0.0001 0.0000
manner: nasal -0.6809 -0.6673 -0.6894 -0.6451 0.0001 0.0000
manner: stop -0.5599 -0.5366 -0.5561 -0.5175 0.0001 0.0000
manner: affricate -0.0217 -0.0167 -0.0590 0.0279 0.4460 0.3736
primary stress vowel next 0.1755 0.1857 0.1619 0.2091 0.0001 0.0000
secondary stress vowel next 0.0946 0.1091 0.0825 0.1383 0.0001 0.0000
seg. is voiced -0.3690 -0.3746 -0.3929 -0.3566 0.0001 0.0000
seg. POA: coronal 0.7533 0.7475 0.6748 0.8228 0.0001 0.0000
seg. POA: dorsal 1.0556 1.0412 0.9690 1.1181 0.0001 0.0000
seg. POA: labial 0.8054 0.7900 0.7199 0.8619 0.0001 0.0000
seg. subplace: dental -0.5154 -0.5162 -0.5586 -0.4747 0.0001 0.0000
seg. subplace: palatal -0.2114 -0.2172 -0.2743 -0.1551 0.0001 0.0000
base duration for POS (log) 0.0580 0.0785 0.0267 0.1325 0.0042 0.0660
primary stress vowel precedes -0.1309 -0.1264 -0.1462 -0.1050 0.0001 0.0000
secondary stress vowel precedes -0.2240 -0.2178 -0.2444 -0.1900 0.0001 0.0000
distance for word end (log) 0.0510 0.0468 0.0262 0.0660 0.0001 0.0000
distance from end of phrase (log) -0.0205 -0.0206 -0.0246 -0.0166 0.0001 0.0000
distance from word start (log) -0.0010 -0.0090 -0.0330 0.0181 0.4842 0.9447
distance from start of phrase (log) 0.0140 0.0138 0.0094 0.0182 0.0001 0.0000
word frequency (log) -0.0081 -0.0089 -0.0133 -0.0045 0.0001 0.0047
seg. probability (log) 0.1343 0.1400 0.1261 0.1532 0.0001 0.0000
seg. informativity 0.0999 0.0967 0.0865 0.1071 0.0001 0.0000
seg. predictability (log) 0.0083 0.0080 0.0043 0.0116 0.0001 0.0001
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 45
2.9.2 Intervocalic segment deletion model
Estimate Std. Error z value Pr(> |z|)Intercept -9.1568 0.8661 -10.5722 0.0000
rate of speech (log) 2.5127 0.1672 15.0255 0.0000
primary stress vowel next -2.0467 0.1557 -13.1488 0.0000
secondary stress vowel next -1.1507 0.1836 -6.2666 0.0000
manner: nasal 1.9423 0.1772 10.9623 0.0000
manner: stop 1.8325 0.1711 10.7124 0.0000
manner: affricate 11.7542 189.5645 0.0620 0.9506
base deletion ratio for POS (log) 0.6699 0.0544 12.3206 0.0000
seg. POA: coronal -3.9470 0.4161 -9.4858 0.0000
seg. POA: dorsal -5.1174 0.4343 -11.7838 0.0000
seg. POA: labial -4.0775 0.3813 -10.6943 0.0000
seg. is voiced 0.6088 0.1154 5.2745 0.0000
distance from word start (log) 1.0311 0.1109 9.2935 0.0000
distance from end of phrase (log) 0.2736 0.0348 7.8542 0.0000
distance for word end (log) 1.0864 0.0941 11.5499 0.0000
seg. subplace: dental 1.1162 0.2910 3.8355 0.0001
seg. subplace: palatal -13.4598 189.5624 -0.0710 0.9434
word frequency (log) 0.2603 0.0223 11.6516 0.0000
seg. probability (log) -0.1632 0.1116 -1.4625 0.1436
seg. informativity -0.2852 0.0907 -3.1439 0.0017
seg. predictability (log) -0.0646 0.0278 -2.3268 0.0200
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 46
2.9.3 Postvocalic segment duration model
Factor Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(> |t|)Intercept -0.9082 -0.9158 -1.1088 -0.7221 0.0001 0.0000
rate of speech (log) -0.5007 -0.5001 -0.5174 -0.4829 0.0001 0.0000
manner: nasal -0.4795 -0.4710 -0.4952 -0.4467 0.0001 0.0000
manner: stop -0.3601 -0.3534 -0.3831 -0.3238 0.0001 0.0000
manner: affricate 0.2568 0.2645 0.1349 0.3927 0.0002 0.0003
base duration for POS (log) 0.0837 0.0967 0.0425 0.1503 0.0001 0.0046
distance from end of phrase (log) -0.0861 -0.0857 -0.0900 -0.0814 0.0001 0.0000
primary stress vowel precedes -0.1404 -0.1417 -0.1567 -0.1261 0.0001 0.0000
secondary stress vowel precedes -0.1437 -0.1412 -0.1718 -0.1102 0.0001 0.0000
seg.: voiced -0.4137 -0.4128 -0.4436 -0.3827 0.0001 0.0000
next seg.: voiced 0.1785 0.1807 0.1628 0.2001 0.0001 0.0000
next seg. manner: glide 0.2221 0.2303 0.1909 0.2685 0.0001 0.0000
next seg. manner: liquid 0.2561 0.2471 0.2194 0.2763 0.0001 0.0000
next seg. manner: nasal 0.0118 0.0135 -0.0355 0.0601 0.5892 0.6595
next seg. manner: stop -0.0713 -0.0743 -0.0914 -0.0569 0.0001 0.0000
next seg. manner: affricate -0.0726 -0.0758 -0.1208 -0.0275 0.0016 0.0071
seg. POA: dorsal -0.0434 -0.0429 -0.0809 -0.0039 0.0334 0.0476
seg. POA: labial -0.0872 -0.0970 -0.1311 -0.0649 0.0001 0.0000
distance from word end (log) -0.0669 -0.0706 -0.0899 -0.0532 0.0001 0.0000
seg. subplace: dental -0.5014 -0.5176 -0.6524 -0.3831 0.0001 0.0000
seg. subplace: palatal -0.1672 -0.2036 -0.3175 -0.1008 0.0004 0.0063
next seg. has same POA -0.0035 0.0071 -0.0150 0.0291 0.5188 0.7754
distance word start (log) -0.0480 -0.0496 -0.0735 -0.0269 0.0001 0.0003
word frequency (log) -0.0121 -0.0123 -0.0163 -0.0082 0.0001 0.0000
seg. probability (log) 0.0854 0.0944 0.0745 0.1140 0.0001 0.0000
seg. informativity 0.0454 0.0515 0.0391 0.0631 0.0001 0.0000
seg. predictability (log) 0.0041 0.0041 -0.0001 0.0080 0.0470 0.0627
CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 47
2.9.4 Postvocalic segment deletion model
Factor Estimate Std. Error z value Pr(> |z|)Intercept -3.8959 0.5217 -7.4675 0.0000
rate of speech (log) 1.3780 0.1138 12.1109 0.0000
base deletion ratio for POS (log) 0.8591 0.0666 12.8927 0.0000
next seg. manner: glide -1.8102 0.3099 -5.8413 0.0000
next seg. manner: liquid 0.8470 0.0841 10.0758 0.0000
next seg. manner: nasal 0.4439 0.1596 2.7820 0.0054
next seg. manner: stop 0.1320 0.0649 2.0330 0.0421
next seg. manner: affricate 0.7641 0.1555 4.9152 0.0000
manner: nasal 1.4454 0.0919 15.7216 0.0000
manner: stop 1.2202 0.1100 11.0919 0.0000
manner: affricate -0.9003 204.4568 -0.0044 0.9965
seg. POA: dorsal -0.4479 0.1932 -2.3181 0.0204
seg. POA: labial 0.5075 0.1272 3.9911 0.0001
seg. is voiced 1.7014 0.1211 14.0458 0.0000
next seg. is voiced -0.9739 0.0651 -14.9563 0.0000
distance from word start (log) 0.7052 0.0758 9.3054 0.0000
seg. subplace: dental 4.3805 0.3735 11.7274 0.0000
seg. subplace: palatal -9.1966 131.8608 -0.0697 0.9444
next seg. has same POA 0.1213 0.0928 1.3072 0.1911
distance from end of phrase (log) 0.0798 0.0254 3.1454 0.0017
primary stress vowel precedes 0.1084 0.0600 1.8071 0.0708
secondary stress vowel precedes -0.0346 0.1466 -0.2358 0.8136
word frequency (log) 0.0519 0.0115 4.5175 0.0000
seg. probability (log) -0.6912 0.0840 -8.2335 0.0000
seg. informativity -0.3961 0.0438 -9.0479 0.0000
seg. predictability (log) 0.0356 0.0160 2.2248 0.0261
Chapter 3
Faithfulness as Information Utility
3.1 Introduction
Across languages, weakening processes such as lenition, deletion and neutralization
may target any given segment of speech. Currently, linguistic theory does not have
means of predicting why some segments are more likely to undergo weakening in
some languages rather than in others. However, it is clear that in a number of un-
related languages (e.g. English, Arabic and Huallaga Quechua) a segment or a set
of segments is targeted by parallel weakening processes : several unrelated weakening
processes. In some of these cases certain segments are weakened in multiple environ-
ments of a single language variety, while in other cases certain segments are weakened
in a single environment, but the weakening pattern differs among several varieties of
the language. When parallel weakening processes can be shown to be structurally
or functionally unrelated to one another, they present a significant challenge to lin-
guistic theory, as there is currently no disciplined way to attribute a tendency to
weaken to a subset of segments in a language. The possibility of accounting for a
segment’s tendency to weaken touches upon the famous actuation problem presented
in Weinreich et al. (1968), as such an account would shed light on some of the reasons
that lead to actuation of structural changes in human language. Here, I show how
characterizing linguistic faithfulness as a pressure to preserve information – as quan-
tified in information theory – can lend insight into why language-specific pressures
weaken particular linguistic elements. Combined with extant effort-based accounts of
48
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 49
markedness, I show it is possible not only to describe, but also to explain and predict
language-specific weakening patterns.
Despite the cross-linguistic rarity of weakening processes that target coronals
exclusively (Gurevich, 2004), American English has at least two unrelated /t,d/-
weakening processes: intervocalic tapping (Kahn 1976 and Zue and Laferriere 1979
among others) and word-final deletion (Guy 1991 among others). Similarly, in Hual-
laga Quechua (Weber, 1989), /q/ surfaces as [g] or [G] in intervocalic contexts and
as [x] before voiceless obstruents; it deletes word-finally. In both languages intervo-
calic and word-final weakening share only the segment being targeted, as they occur
in different environments and result in different outcomes. Both American English
and Huallaga Quechua therefore exhibit a case of parallel weakening processes of a
segment in a single variety.
Dialects of Arabic and UK English provide an even more puzzling case of paral-
lel weakening processes, as in both languages there are different, parallel weakening
processes of the same segment in different varieties : /t/-debuccalization, tapping
and spirantization in UK English dialects (Mathisen 1999; Urszula 2004; Raymond
2004 among many others) as in the Irish English varieties in (3.1; data from Hickey
2009), and /q/→[g], /q/→[P] and /q/→[k] weakening in Arabic dialects as in the di-
alects in (3.2; data from Kaye and Rosenhouse 1997), all spoken in close geographical
proximity.
(3.1) Variety ‘butter’
Northern varieties [b2tˆ@ô]
Southern varieties [b2R@õ]
Vernacular Dublin [bUP5]
(3.2) Dialect baqara ‘cow’ (MSA)
Druze [baqara]
Nazareth [bakara]
Jerusalem [baPara]
NW Jordan [bagara]
It is important to note that both UK English and Arabic demonstrate parallel weak-
ening and cannot be regarded as a more extreme application of weakening along a
single cline. In English, the tapping of /t/ retains the place but not the manner
of its articulation while the debuccalization of /t/ retains the manner but not the
place of its articulation. In Arabic, debuccalizing varieties can be shown not to have
undergone a /q/→[k] process, as /k/ remains a distinct phoneme in such dialects.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 50
Currently, linguistic theory cannot explain why several varieties of a single language
would undergo such similar yet independent structural changes.
Some of the current theoretical explanations for the typology of possible and
impossible structural changes in language (Weinreich et al.’s constraints problem)
rely on finding an equilibrium between two or more universal forces. For example,
Optimality Theory (Prince and Smolensky, 1993) balances markedness constraints
with faithfulness constraints. The principle of least effort (Zipf, 1949) balances the
speaker’s economy with the auditor’s economy – speakers’ desire to reduce their ef-
fort is bounded by their need to make themselves understood. Modern phonological
theories that balance effort with perceptual confusability (Flemming, 2004; Boersma,
2003) can also be viewed as balancing speakers’ economy with listeners’ economy.1
By balancing two opposing forces such theories rule out certain linguistic configu-
rations or structural changes. For instance, a change that is undesirable on both
scales (such as increasing effort as well as confusability) is predicted not to occur.
Such theories allow for multiple possible equilibria when a change is desirable on one
scale and undesirable on another (such as increasing effort while decreasing confus-
ability or decreasing effort while increasing confusability). However, since multiple
equilibria are available for universal functional forces it is impossible to predict which
equilibrium each language will choose.2 Consequently, no theory that is based on
balancing universal forces can predict which segments are likely to be weakened in
some language.
To compensate for these limitations, I introduce a new functional force, the preser-
vation of information utility. Information utility represents the amount of information
speakers attribute to linguistic elements, and its preservation applies variably: that
is, the more information a linguistic element carries, the stronger the desire to pre-
serve it. I show that like articulatory effort and confusability, information utility can
be estimated by speakers over language use. However, unlike articulatory effort and
confusability, the information utility of linguistic elements differs across structurally
1In pragmatics the Q and R Principles of Horn (1984) can be viewed as balancing the same twoforces, and Horn does indeed make this comparison.
2Eternal optimization (Boersma, 2003) relies on the inability to choose among multiple gooduniversal equilibria.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 51
similar languages. Therefore, while the desire to preserve information utility is univer-
sal, a particular linguistic element may be subject to stronger or weaker preservation
forces in some languages as opposed to others. Indeed, other things being equal, lan-
guages in which a given linguistic element (such as /t/) has relatively low information
utility will be less likely to preserve that element.
Building on both effort and information utility, I show that the distribution of
weakening processes in different languages emerges from the interaction between two
opposing forces: avoiding effort and maximizing information utility. Effort avoidance
follows from the principle of least effort (Zipf, 1949), and applies variably: that is, the
more effort the articulation of a given linguistic element requires, the less speakers
will be willing to make that effort. Information utility maximization applies variably
as well – the higher the information utility of a linguistic element, the stronger the
pressure to correctly transmit it to listeners. Under this framework, weakening pro-
cesses are actuated when the effort required by the articulation of a linguistic element
exceeds the effort justified by its information utility. The proposed account suggests,
then, that the actuation of weakening processes is explicable in information-theoretic
terms.
Optimality Theory is well-suited to describe the way a grammar may balance
leveling and preserving forces. In OT structural leveling is motivated by marked-
ness constraints, while preservation is motivated by faithfulness constraints. To bet-
ter subject the proposed constraints to empirical scrutiny, in what follows, I model
markedness in terms of effort, and faithfulness in terms of information utility. I then
show how these two opposing forces can be used to aptly predict the distribution of
weakening process across languages, once given this quantitative characterization.
3.2 Parallel weakening processes
3.2.1 Weakening patterns are not arbitrary
One of the main focuses of every theory in phonology is that of describing phonological
alternations. Both diachronic sound change and synchronic alternations demonstrate
that segments are mutable under certain conditions, and various theoretical frame-
works describe how a particular structural change can spread through the sound
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 52
system of language communities and speakers (Kiparsky 1995; Pierrehumbert 2001
among many others). However, such theories do not attempt to predict what would
lead a given language to undergo particular structural changes rather than others.
Pierrehumbert (2001), for instance, acknowledges that “in a complete model of his-
torical change it will be necessary to offer some explanation of why certain languages
at certain times begin to permit particular leniting changes while not permitting oth-
ers.” Ohala (2003, §3 pp. 684) goes further by suggesting that it is impossible to
predict the actuation of structural changes: “as far as the initiation of sound change
is concerned, this question may be unanswerable and not worth pursuing.”
In linguistic theory, unanswerable problems typically arise either when a given
change exhibits a (seemingly) random pattern or when that change is caused by
extra-linguistic sources, such as language contact. However, as I aim to show, the
accumulation of weakening processes that target a single segment or set of segments
in one language can be regarded neither as random, nor as solely explicable in terms
of extra-linguistic processes. On the contrary, I contend that there is ample room in
linguistic theory to try and predict which linguistic elements are likely to undergo
weakening in a given language.
3.2.2 Same language, same segments, multiple processes
Current theoretical frameworks are unable to give a consistent account for the ac-
cumulation of language-specific weakening processes that target a set of segments.
American English, which demonstrates a range of /t,d/-weakening and optional dele-
tion processes, proves illustrative. In American English, final /t/ and /d/ delete in
varying rates (Guy 1991 among others), intervocalic /t/ and /d/ are tapped (Kahn
1976; Zue and Laferriere 1979 among others) and even word-medial /t/ and /d/ are
more likely to delete than other stops (chapter 2). Yet no principled account has been
proposed to account for the multitude of /t/ and /d/ targeting processes.
Moreover, extant theory fails to explain the selective nature of such processes.
For instance, final consonant deletion in English targets /t/ and /d/ and not other
stops. Similarly, only /t/ and /d/ tap intervocalically, while other oral stops do not
become sonorants in the same environment. While they share a set of segments,
the two processes are otherwise unrelated, as they arise from different pressures that
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 53
apply in different environments In American English the difference between the two
environments is evident in the different outcomes of the two processes: sonorization
in intervocalic environments, and deletion word-finally.
Many theories including Optimality Theory (Prince and Smolensky, 1993; Mc-
Carthy and Prince, 1995) can explain these two processes independently of one an-
other, but might just as easily account for a minimally different English, k-English, in
which /t,d/-tapping remains unchanged, /t,d/-deletion does not occur, but /k/ does
delete word-finally. There is thus no principled reason on offer for why English, as
commonly spoken, and not k-English, is the observed outcome of these processes.
Moreover, difficulties arise in accounting for similar processes independently from
one another. The chief problem is the likelihood of finding such a system. In a cross-
linguistic comparison of intervocalic lenition processes, Kaplan (2010) reports that out
of the 136 languages and dialects in Gurevich (2004) that also have a full consonant
inventory, only American English exclusively weakens coronals intervocalically. In a
survey of lenition processes in 272 languages and dialects in Kirchner (1998), obliga-
tory word-final deletion applies exclusively to coronal stops only in Umbrian (mostly
to /d/), but Buck (1904, p. 146, footnote 1) reports that final /k/ was also “weakly
sounded” and “frequently omitted in writing.” Additionally, excluding dialects of
English, only two languages (Middle Egyptian, Limbu) have some other word-final
coronal-only lenition process.3 Since both processes are rare cross-linguistically, the
odds of having both occur in a single language by chance alone are very small.
The case of multiple processes targeting the same set of segments is not unique
to American English. In Huallaga Quechua (Weber 1989, also summarized in Kirch-
ner 1998), /q/ surfaces as [g] or [G] in intervocalic contexts, as [x] before voiceless
obstruents and deletes word-finally. Only /q/ displays such a range of processes in
Huallaga Quechua.
Both American English and Huallaga Quechua dramatically contrast their respec-
tive weakening processes by both the environment in which they appear (intervocali-
cally, word-finally) and the outcome of the process (lenition, deletion). However, one
can argue that even when the outcome of the process remains constant across envi-
ronments, any case in which environment-specific weakening spreads to environments
3 Word-finally, Limbu has a /t/→[Pl] process and Middle Egyptian has a /t/→[P] process.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 54
that do not motivate the application of the original process may be regarded as a
case of parallel processes.
Such is the case of Uyghur (Hahn and Ibrahim 1991, summarized in Kirchner
1998), which has a /å/→[K] process that applies both in intervocalic contexts and
word-initially. While in intervocalic contexts Uyghur has additional spirantization
processes that affect all velar and uvular stops as well as /b/, only /å/ weakens also
word-initially (only the weakening of /å/ “drifted” to another position). Word-initial
environments do not typically lead to spirantization, suggesting that the spiranti-
zation of /å/ in word-initial contexts does not follow from the same cause as the
segment’s intervocalic spirantization. Therefore, the word-initial application of å-
spirantization does not merely represent a wider environment for a single process,
but rather a separate, parallel process with an identical outcome.
The accumulation of weakening processes that target a specific set of segments
appears in several unrelated languages. However, current linguistic theory does not
predict that such accumulations can be linguistically motivated. The next section
discusses a closely related phenomenon – parallel weakening across several language
varieties.
3.2.3 Same language, different dialects, similar processes
The accumulation of language-specific weakening processes that target a set of seg-
ments appears not only in several environments in single language variety, but also
across several varieties of a single language.
In several varieties of British English, /t/ is the target of several different socially-
conditioned weakening processes in intervocalic environments. The most famous and
widespread pattern is debuccalization (as in Mathisen 1999, among others), in which
/t/ surfaces as a glottal stop. Other processes can also be found. In Irish English
varieties /t/ surfaces as [R], [P], [h], ∅ (deletion) and [t„], an apico-alveolar fricative
(Raymond, 2004).4 Similarly, in West Midlands English varieties /t/ may surface as
[t] (unchanged), [P] and [R] (Urszula, 2004).
The weakening processes that lead to the emergence of [P] and [R] realizations of
/t/ are incompatible with one another. When /t/ glottalizes as in (3.3a), its place of
4I follow Raymond (2004) in using a t-based symbol for that fricative.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 55
articulation is not preserved, but it is still a voiceless stop. When /t/ is tapped as in
(3.3b), its place of articulation is preserved, but the manner of articulation is not, as
/t/ becomes an approximant.5
(3.3) (a) ‘water’ [wOP@]
(b) ‘water’ [wORÄ]
Glottalization must therefore evolve from a stage in which intervocalic /*t/ is still
a stop, and tapping must evolve from a stage in which intervocalic /*t/ is still a
coronal. These are therefore parallel weakening clines (3.4). Each individual step on
a weakening cline is a parallel weakening process.
(3.4) Two weakening clines are considered parallel when both clines target an
identical set of linguistic elements, and the input form of either cline could not
have been the output form of the other cline.
Consider an ancestor of P-varieties and R-varieties in which /*t/ is still a voiceless
coronal stop, the t-variety. Since neither change has occurred yet, the speakers of
the t-variety have no evidence that a change is about to happen. However, when
the varieties eventually split, they undergo similar parallel weakening processes. The
varieties that would end up as P-varieties debuccalize or coarticulate a glottal stop
with the voiceless coronal stop, whereas the varieties that would end up as R-varieties
tap, voice or sonorize the voiceless coronal stop.
Parallel weakening processes can be explained in one of three ways. The first
approach is to assume that the parallel processes have indeed developed independently
of one another, by unrelated grammatical changes that happened to achieve similar
results in both P-varieties and R-varieties. The second approach is to assume that even
though the processes are independent of one another, parallel processes are a result
of the ongoing contact between the two families of varieties (a shared innovation).
The third approach is to search for some property of English that would have led
both processes to occur, something that would make the English /t/ more prone to
undergo weakening than other segments and therefore more vulnerable to change.
5A similar argument can be made for varieties in which /t/ surfaces as a fricative [t„].
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 56
The argument against the first approach – the independent change account – is
easily stated, and indeed is similar to the argument mentioned in the previous section,
against the independent co-occurrence of multiple different processes that target the
same segments in different environments. In short, the chances of such similar changes
co-occurring independently are quite small. Note that while varieties in which /t/
weakens share that weakening as a common property, the particular property of /t/
that is not being preserved is different in each case. For instance, while P-varieties
require that /t/ cannot have a place of articulation in intervocalic contexts, they
do preserve the place of articulation of other segments. R-varieties and t„-varieties,
meanwhile, require a segment to be sonorant or continuant in intervocalic contexts,
while allowing other stops to faithfully surface as stops in similar contexts. Since the
grammars of all varieties evolved from a grammar in which /t/ surfaced as a coronal
stop, and since the odds of having any intervocalic weakening of exclusively coronal
stops is quite small (in the Kaplan 2010 survey of 136 languages and dialects only
dialects of English weaken exclusively coronal stops in intervocalic positions), the
odds of having both processes occur in different dialects by chance alone are highly
improbable.
Let us turn then to the second approach. Could contact between two families
be a good reason for the appearance of parallel processes? Hypothetically, a speaker
of a /t/-variety that is exposed to a P-variety (for instance) might perceive that /t/
is prone to weakening in intervocalic positions and is therefore subject to change.
At a glance, this proposal seems plausible. However, there is a problem with this
suggestion. The exposure to a P-variety would also include exposure to a particular
solution to that problem: glottalization. Were the speaker to embark on a different
weakening process, it would require the speaker to heed only one of these cues, the
weakness of /t/, and not the particular solution, glottalization.
Parallel weakening clines are not unique to English. Classical Arabic /*q/ (Mod-
ern Standard Arabic /q/) has many different surface forms across spoken Arabic
dialects. Kaye and Rosenhouse (1997) report that /*q/ tends to weaken to [P], [k],
[g] and less frequently to [ G], as seen in the dialects presented in (3.2) repeated here
as (3.5).
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 57
(3.5) Dialect baqara (Modern Standard Arabic)
‘cow’
Druze /baqara/
Nazareth /bakara/
Jerusalem /baPara/
NW Jordan /bagara/
One could give a trivial account for the many forms /q/ takes in Arabic’s different
dialects by claiming that each form is a stage on a single cline, namely the one in
(3.6), in which it is possible to imagine successive weakening processes leading from
/q/ to /P/.
(3.6) q → G → g → k → P
However, the proposed series of changes in (3.6) has to be rejected, since it would
collapse phonemic distinctions that remain intact in dialects such as Egyptian Arabic
(Kilany et al., 1997). In Egyptian Arabic /*q/ surfaces as [P], at the very end of
the hypothetical cline in (3.6). If /q/ weakening had followed the series of changes
proposed in (3.6), then q → G → g would have collapsed /q/ and /g/, and g → k
would have collapsed /g/ and /k/. Yet in Egyptian Arabic all three remain distinct.6
The /q/→[g] and /q/→[k] processes are therefore independent of the /q/→[P] process
that is found in Egyptian Arabic, providing another case of parallel weakening.
Both Arabic and UK English thus exhibit parallel clines of a single segment in
different dialects. In both cases, it is unreasonable to claim that the shared target of
the weakening process is merely due to chance. At the same time, dialectal contact
must also be ruled out. What remains to be seen is whether the third approach might
prove fruitful: whether, perhaps, UK English has some linguistic property that leads
it to weaken /t/ preferentially over other segments, while Arabic has some linguistic
property that leads it to preferentially weaken /q/. This question is the focus of the
rest of this chapter.
6[g] and [k] are the surface forms of Classical Arabic /*Ã/ and /*k/, respectively. /*g/ is thepredecessor of /*Ã/ in Proto-Semitic, and may represent an earlier stage of Arabic.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 58
3.2.4 The challenge of explaining parallel weakening
The accumulation of parallel weakening processes in a given language suggests the
existence of language-specific conspiracies (Kisseberth, 1970). Taken together, the
cases of English, Arabic and Huallaga Quechua provide illustrative examples of par-
allel weakening on the opposite ends of the scale of voiceless stops. While /t/ is
unmarked and frequent in the worlds’ languages (Maddieson, 1984), /q/ is marked
and infrequent. Thus, any universal scale that could apply to both /t/ and /q/ could
also apply to any other subset of voiceless stops (and the standard treatments of
markedness hierarchies can target other subsets of voiceless stops; see discussion in
§3.6). In a similar vein, there is no obvious language-specific functional scale that
could group the English /t/ and Arabic /q/ together and exclude the other voice-
less stops. In both languages /t/ is the most frequent voiceless stop (data from the
studies in §3.5), and the articulatory, acoustic and perceptual effects both /t/s have
are similar, yet English /t/ is subject to parallel weakening processes and the Ara-
bic /t/ is not. It is not clear what linguistic factors can account for the different
language-specific weakening patterns.
The goal of this chapter is not only to provide an explanation or a formal descrip-
tion for each process separately, but to find a common cause for each of these parallel
weakening patterns: in short, to explain weakening, rather than describe it.
3.3 The proposed account
3.3.1 Outline – replacing markedness hierarchies
Optimality Theory’s standard treatment of linguistic phenomena balances marked-
ness on the one hand and faithfulness on the other. Markedness represents the various
leveling forces that affect linguistic performance, and does not apply uniformly. It
is standard practice to assume markedness hierarchies that correspond to linguistic
universals. For instance, dorsal segments are considered more marked than coronal
segments, and therefore any language that allows dorsals to appear in some position
(complex onsets, syllable codas) should also allow coronals to appear in the same
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 59
position. Markedness is opposed by faithfulness, which represents the various pre-
serving forces that operate in language. Faithfulness is assumed to follow the same
hierarchy as markedness, and does not apply uniformly either. In §3.6 I show that if
the two scales are treated as identical, or if both scales follow universal hierarchies,
language-specific weakening cannot be predicted.
In recent years markedness has often been identified with phonetic factors such
as articulatory effort avoidance and increased confusability (Kirchner, 1998; Steriade,
2008; Boersma, 2003; Flemming, 2004). Following this practice, markedness hier-
archies may be identified with various levels of effort and confusability. However,
phonetic factors alone cannot capture why faithfulness is not applied uniformly. For
instance, though no one claims that word final /t/ should require more effort than
word-final /k/, it is notable that American English allows word-final deletion of /t/
but not (correspondingly) of /k/.
Here, I build on previous work on information-theoretic (Shannon, 1948) ap-
proaches to language (van Son and Pols 2003; Aylett and Turk 2004; Levy and Jaeger
2007; Jaeger 2010 and others) to introduce a new universal functional force, infor-
mation utility, which motivates the preservation of linguistic elements. Infrequent,
unpredictable and informative linguistic elements provide more information than fre-
quent, predictable and uninformative linguistic elements, and therefore have higher
information utility. I propose that high information utility necessitates low confusabil-
ity and justifies the expenditure of effort to achieve low confusability. Accordingly,
I propose that languages place linguistic elements with high information utility in
prominent positions, and that speakers match their willingness to preserve linguis-
tic elements with the amount of information they expect those elements to provide.
Importantly, the information utility of linguistic elements is language-specific. For
example, American English /s/ may provide more information than Spanish /s/, and
may therefore be subject to a greater preserving force than Spanish /s/. The differ-
ent amounts of information emerge from language use and are therefore available to
speakers, much like universal properties such as effort.
The proposed account combines the novel treatment of information utility as a
preserving force with current effort-avoidance accounts. I assume that speakers at-
tempt to preserve as much information as possible (Most information Utility) while
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 60
avoiding effort (Least Effort) or MULE. More specifically, the higher an element’s in-
formation utility is, the less confusable it should be, and by necessity, the more effort
its articulation justifies.7 The link between information utility and effort is monoton-
ically increasing. As the articulatory effort increases, so does the minimal amount
of information utility that could justify that effort (or does not diminish). The ex-
act threshold function of minimal information utility per effort is language-specific.
Weakening is predicted to occur when a linguistic element’s information utility is not
high enough to justify its faithful pronunciation.
In a sense, MULE is an extension of Haspelmath (2006), who shows that marked-
ness is not a uniform concept and proposes to replace it with frequency. Similarly,
Hume (2004) proposes to replace markedness with predictability. However, there are
key differences. In MULE, the information-theoretic measurements represent faithful-
ness rather than markedness, and information estimates (mainly informativity, Cohen
Priva 2008; Piantadosi et al. 2011) are used to evaluate information utility rather than
frequency or predictability. The use of information estimates is discussed in §3.4.
3.3.2 Using information utility and effort to predict parallel
weakening
In MULE linguistic elements will tend to weaken when their information utility is too
small to justify the articulatory effort associated with their faithful pronunciation.
The graph in (3.7) represents a sample language L1 in some linguistic environment
V (for instance, syllable coda). There are four linguistic elements in L1 marked as
α, β, γ and δ. Their position in the graph is determined by the effort that their
pronunciation requires in V (3.8) and by the information utility (3.9) each of them
provides. The diagonal line passing through the graph is the monotonically increasing
relationship between effort and information utility in L1. The gray zone represents
the area in which an element’s information utility is not high enough to justify its
faithful articulation. In (3.7), α and γ are outside the gray zone but β and δ are
inside the gray zone. This means that the information utility of β and δ is not high
enough to justify their faithful pronunciation, and they are more likely to be targeted
7 Information utility is just one in a range of possible utility accounts which include pragmatic,semantic, syntactic, morphological and sociolinguistic factors.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 61
by a weakening process in the V environment of L1 than α and γ are.
(3.7)
α
β
γ
δ
effort
info
rmati
on
uti
lity
(3.8)
effort (α in V ) < effort (β in V ) < effort (γ in V ) < effort (δ in V )
(3.9)
information-utility (β) < information-utility (α) <
information-utility (δ) < information-utility (γ)
Notably, comparing effort and information utility is not straightforward. First,
there is currently no account that assigns a quantity of effort to a linguistic element.
Instead, effort is evaluated by comparison: it is possible to say that voicing /b/ is
easier than voicing /g/ (Ohala, 1981), but not that /b/ requires some amount of
effort and /g/ requires twice as much effort. Second, even if it were possible to assign
numeric values to both effort and information utility, each measurement would have
different units, and it might not be possible to know beforehand what kind of scaling
would allow a comparison between the two measurements. Moreover, it is not unlikely
that the comparison between the different measurements is language-specific: some
languages might require a higher amount of information utility for the same amount
of effort. In sample language L2 (3.10) the information utility and articulatory effort
of pronouncing α, β, γ and δ in environment V are the same as in L1, but a different
function links information utility and effort, making α and β rather than β and δ
prone to undergo weakening in the environment V in L2.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 62
(3.10)
α
β
γ
δ
effort
info
rmati
on
uti
lity
Thus, assessing which element is more likely to undergo weakening relies on com-
paring both the articulatory effort required by each linguistic element and the infor-
mation utility it provides. The simplest case is binary comparison: if some linguistic
element e1 requires more articulatory effort than another element e2 and provides less
information utility than e2 does, then e1 will necessarily be more prone to undergo
weakening than e2 since there is no monotonically increasing function linking effort
and information utility that could achieve that. For instance, the linguistic element
β in (3.7) and (3.10) requires more articulatory effort than α in environment V and
provides less information utility, and indeed there is no way to put α in the gray zone
without also putting β in the gray zone. §3.5.2 demonstrates how binary comparison
can be used to account for the debuccalization of /q/ in Egyptian Arabic.
Another alternative is to compare the real-valued distance between the information
utility of different linguistic elements while controlling for the effect effort may have.
For instance, if several languages have a similar segment inventory but differ with
respect to the information utility of each segment, it is possible to compare the
information utility that segments have in each language, since the articulatory effort
and confusability of the segments should remain constant. Consider the two languages
in (3.11–3.12). Both languages have two oral stops: /t/ and /k/, but the information
utility of /t/ is lower in in (3.11) than in (3.12) while the information utility of /k/ is
comparable. If we assume that the effort required to pronounce /t/ and /k/ is similar
in both languages, and just compare information utility across the two languages,
it becomes apparent that it is easier to link effort and utility so that /t/ does not
provide enough information utility in (3.11) than in (3.12). §3.5.3 shows how real-
valued comparison can be used to account for multiple /t,d/-weakening processes in
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 63
American English.
(3.11)
/t/
/k/
effort
info
rmati
on
uti
lity
(3.12)
/t/
/k/
effort
info
rmati
on
uti
lity
3.4 Implementation
3.4.1 Implementation overview
In order to correctly test the predictions made by MULE, it is necessary first to know
how to quantify its different components: articulatory effort, perceptual confusability
and information utility.
3.4.2 Measuring information utility
As a measure, information utility relies on the amount of information speakers esti-
mate a linguistic unit would hold. In this chapter I approximate information utility
by using the informativity of linguistic units (Cohen Priva, 2008; Piantadosi et al.,
2011), as discussed in chapter 2.
The amount of information some linguistic element e provides when the context
it appears in c is already known is the predictability of e given c (3.13), which can be
estimated using counts as in (3.14). The predictability of e given c is equivalent to
measuring how predictable e is given that c is already known.
(3.13)
− log Pr (e|c)
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 64
(3.14)
− logcount (e, c)
count (c)
Importantly, the predictability of a linguistic element can change markedly from
one context to another. The segment /m/ provides almost no information in some
context (after ‘acade–’) but provides much information in another (/m/ in ‘dim’
provides a lot of information since /m/ rarely follows /dI/). Predictability is taken
to be an important factor in a number of language processing accounts (Levy and
Jaeger 2007; Raymond et al. 2006 to name a few). As a measure, it relies on the
amount of information a given element provides in a particular context.
However, predictability alone cannot successfully capture the exceptionless prop-
erties of sound change. For instance, in a language that licenses a weakening process
that targets some segment, that segment may weaken regardless of how much in-
formation it conveys in a particular context. The conspiracies discussed above all
constitute cases in which a segment is prone to undergo weakening uniformly in some
linguistic environment, not merely in words in which it is predictable. To better
account for this, I use informativity, the average (or expected) information value of
linguistic elements which remains constant across contexts. A linguistic element that
usually provides little information will have low informativity, regardless of how much
information it provides in the actual context it appears in. For example, English /N/
appears mostly as the second segment of the ‘–ing’ suffix, in very predictable contexts.
Therefore, /N/ has low informativity even in the few contexts in which it does provide
a lot of information. Informativity (3.15) is defined as the weighted average of the
predictability of a linguistic element given all possible contexts, c ∈ C. I will discuss
the use of informativity in §3.7, and compare it to some of its alternatives such as
frequency and predictability.
(3.15)
−∑c∈C
Pr (e, c) log Pr (e|c)Pr (e)
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 65
3.4.3 Integrating effort into MULE
In line with current phonetic theories, MULE treats articulatory effort and perceptual
confusability as the functional forces that motivate markedness. Importantly, previ-
ous work has shown that effort cannot be modeled without taking the phonological
context into consideration (Pouplier, 2003). I therefore model effort and confusability
jointly in MULE. Doing so is necessary because effort has to be evaluated in each
linguistic environment separately. For instance, intervocalic voicing of obstruents
decreases their distinctness from the surrounding vowels and may be considered a
reduction in effort and an increase in confusability (Boersma, 2003). However, main-
taining a distinction in voicing in codas is considered difficult (Steriade, 2008). The
effort associated with intervocalic voiceless obstruents should therefore be higher than
the effort associated with voiced obstruents. Conversely, the effort associated with
voiceless obstruents in codas should be lower than that of voiced obstruents in codas.
In order to model articulatory effort and perceptual confusability jointly I assume
that given some amount of articulatory effort, speakers choose the least confusable
form allowed for that amount of effort. This assumption follows Flemming (2004). For
any given segment an increase in the articulatory effort is motivated by an attempt to
decrease confusability; increased effort that does not result in decreased confusability
(or support additional distinctions) is ruled out.8
For the most part, MULE only requires comparison of functional pressures: i.e., an
estimate of which surface form requires more effort than other surface forms. There-
fore, for the purposes of this chapter, whenever two similar segments are directly
compared, the standard OT markedness hierarchy (de Lacy, 2002) is used to rank
effort, and whenever two dissimilar segments are compared, their respective frequen-
cies in the worlds languages (using Maddieson’s 1984 UPSID) are computed to the
same end.
For future applications of MULE it would be desirable to use more fine-grained
comparisons, as would follow from the study of phonetics, integrating articulatory
effort (Ohala 1981 among others), and confusability (Steriade 2008 among others).
8The organization of the language allows additional mechanisms that will not be discussed here,such as moving linguistic elements that require low confusability to prominent positions.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 66
3.5 MULE in OT
3.5.1 Effort and information utility in OT
MULE deals with the optimization of speakers’ performance across two axes: maxi-
mizing information utility while minimizing the effort required for their articulation.
OT provides a mechanism to distinguish between the two: markedness constraints
lead to simplification, and faithfulness constraints lead to preservation. I therefore
propose an OT implementation of MULE that bases faithfulness on information util-
ity, and markedness on effort and reduced perceptibility. The OT version of MULE
allows integration with current phonological theories.
Like standard OT with marked faithfulness, the ranking of the faithfulness hierar-
chy is predetermined and based on the distribution of segments in the language, and
so is the ranking of the markedness hierarchy. Faithfulness and markedness might not
be ranked in the same order, but both are observable by speakers by their exposure
to the language. Therefore, the ranking in a MULE OT grammar is not more difficult
to learn than standard OT ranking.
Faithfulness is expressed using information utility, and has a total order ranking
that is predetermined by the distribution of linguistic elements in the language: the
higher the information estimate of some element, the more the language tries to
preserve it.9 If E is a type of linguistic element (feature, segment, morpheme), then for
every two elements of type E, e1, e2, faithfulness to e1 will be greater than faithfulness
to e2 if and only if the information estimate of e1 is greater than the information
estimate of e2, as shown in (3.16).
(3.16)
∀e1, e2 ∈ E. info-estimate(e1) > info-estimate(e2)↔
faithfulness (e1)� faithfulness (e2)
9It is quite possible that speakers would not be certain whether one linguistic element has higherinformation utility than some other linguistic element. In that case the alternatives diverge based onthe theoretical framework. If the framework allows two constraints to remain unranked with respectto one another, they will remain unranked. In a multiple grammars framework, speakers may haveboth possible rankings. Other frameworks may collapse the distinction between the two elements.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 67
Segment and feature markedness is expressed likewise, by a total order ranking on
the effort of articulating a segment or a feature so that they may be correctly perceived
in the environments they appear in, and resist environment-optimizing pressures. If
E is a type of linguistic element and V is an environment (syllable coda, unstressed
syllable), then for every two elements of type E, e1, e2, the markedness of e1 will be
greater than the markedness of e2 if and only if the effort required to pronounce e1 in
the environment V so that it is perceptually identifiable is greater than the effort that
it takes to pronounce e2 in the environment V so that it is perceptually identifiable,
as shown in (3.17).
(3.17)
∀e1, e2 ∈ E. effort(e1 in V ) > effort(e2 in V )↔*V (e1)� *V (e2)
In order to rank markedness of segments across all environments it is possible to cal-
culate an aggregate measure of effort that would rank the effort of linguistic elements
without an environment or at a special ∅ environment.
3.5.2 Binary comparisons in OT
Binary comparison constraint ranking
Without the use of real-valued weights, the proposed OT implementation can only
implement binary comparison (§3.3.2). Consider a language with two linguistic el-
ements es and ew such that the effort required to pronounce es in environment V
is lower than the effort required to pronounce ew in the same environment, and the
information estimate of es is greater than the information estimate of ew as in (3.18).
es will weaken in V only if ew also weakens in V . The reason is that ew is more
marked than σs (3.19), but faithfulness to es is greater than faithfulness to ew (3.20).
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 68
(3.18)
es
ew
effort
info
rmati
on
uti
lity
(3.19)*V (ew)� *V (es)
(3.20)
Ident (es)� Ident (ew)
Any environment V in which a markedness constraint targeting es outranks faithful-
ness to es (3.21) will by transitivity be an environment in which the parallel marked-
ness constraint targeting ew outranks faithfulness to ew (3.22).
(3.21)*V (σs)� Ident (es)
(3.22)*V (ew)� *V (es)� Ident (es)� Ident (ew)
The ranking between faithfulness constraints and the reversed ranking between
markedness constraints in the case described above allows exactly three configura-
tions: both ew and es weaken as in (3.23), only ew weakens as in (3.24), and both ew
and es are preserved as in (3.25). There is no ranking in which es weakens, but ew
does not. In the examples weakened segments are struck out.
(3.23) Both weaken
ew.es*V (ew) *V (es) Ident{es} Ident{ew}
ew.es *! *
ew.es *! *
ew.es *! *
+ ew.es * *
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 69
(3.24) Only ew weakens
ew.es*V (ew) Ident{es} *V (es) Ident{ew}
ew.es *! *
ew.es *! *
+ ew.es * *
ew.es *! *
(3.25) Both preserved
ew.es Ident{es} Ident{ew} *V (ew) *V (es)
+ ew.es * *
ew.es *! *
ew.es *! *
ew.es *! *
Following this example, it is possible to follow a simple procedure to determine
whether a segment is expected to be prone to weakening (3.26)
(3.26) (a) Understand what causes the pressure to weaken
(b) Establish a comparison set : a list of segments that are under
a similar pressure to weaken: all the segments that could be
affected by the pressure to weaken in (a).
(c) Rank the segments by effort
(d) Rank the segments by information estimates
(e) Find mismatches between the effort and information estimate
rankings.
Binary comparison in MULE can predict a stronger pressure to weaken when the
effort and information estimate hierarchies are mismatched: there is some segment
that requires more effort to preserve in the relevant environment, but has a lower
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 70
information estimate. A counter-example to binary comparison in MULE would be
a case in which a segment that is both easier to maintain in some environment and
has a high information estimate weakens before segments that are more difficult to
maintain in that environment and have lower information estimates.
Binary comparison account for Egyptian Arabic /q/-weakening
Binary comparison can be used to account for the parallel weakening of /*q/ in
Arabic. §3.2.3 introduced the puzzle of parallel weakening clines of Classical Arabic’s
/*q/ to different surface forms in modern dialects: [g], [P], [k], and [ G]. Since some
of the weakening clines can be shown to have emerged in parallel, it seems that there
was some driving force in the stage of Arabic in which /*q/ was still a [q] that was
biased towards the possibility of having /*q/ weakening. The prediction in MULE is
that for some reason the information estimate of /q/ does not justify the amount of
effort that is associated with its faithful articulation.
In order to evaluate whether binary comparison justifies the weakness of /*q/, it
is necessary to understand in which environment /*q/ weakens, what is the pressure
that its weakening yields to (3.26a), which segments are under a similar pressure
to weaken (are in the comparison set, following 3.26b), rank the comparison set by
effort and information estimates (3.26c–d), and see whether the information estimate
of /*q/ is lower in that environment than that of other segments in its comparison
set (3.26e).
Most available corpora of Arabic are in Modern Standard Arabic, which is highly
influenced by literary standards rather than articulatory pressures. A notable ex-
ception is the corpus used in this study, LDC Egyptian Colloquial Arabic Lexicon
(Kilany et al., 1997), which contains the phonemic representation of words as they
are spoken in Egyptian Arabic, as well as usage counts for each word.
In Egyptian Arabic /*q/ surfaces as [P] (/q/→[P]), a debuccalization process that
applies in all environments. This means that the pressure /*q/ yielded to is akin to
OT’s *Place, forbidding segments with a place of articulation from expressing that
place. The comparison set consists of all the segments except the laryngeals (/h/
and /P/) and the pharyngeals (/Q/ and /è/). The laryngeals are excluded because
they are the outcome of debuccalization processes, and the pharyngeals are excluded
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 71
since their place of articulation is grouped together with laryngeals rather than with
other places of articulation starting with feature geometry (McCarthy, 1994) and
subsequently in OT (Lombardi, 1995).10 11
In order to evaluate the information estimate of each segment, the predictability
of each segment occurrence was calculated using the phonemic/phonetic transcrip-
tion and word counts as they appear in the LDC Egyptian Colloquial Arabic Lexicon
(Kilany et al., 1997), using the formula in (3.14). Geminate consonants were treated
as plain consonants followed by a special “gemination” symbol, rather than as two
occurrences of the same segment. Informativity was calculated as the weighted av-
erage of the predictability of all occurrences using the formula in (3.15). The list
of consonant informativity can be found in Table 3.1, which includes both core and
periphery segments, and distinguishes between the /P/ that has underlying /P/ and
tends to get deleted (/P/→[∅]), the underlying /q/ that surfaces as [P] (/q/→[P]) and
the underlying /q/ that is lexically specified to surface as [q] (/q/→[q]).12
Table 3.1: Egyptian Arabic Informativity-based information estimates
Phone Informativity Phone Informativity Phone Informativity/n/ 1.488 /Q/ 2.591 /f/ 3.241/l/ 1.564 /S/ 2.596 /z/ 3.267/R/ 2.070 /j/ 2.668 /g/ 3.420/t/ 2.102 /b/ 2.718 /è/ 3.537/dG/ 2.119 /s/ 2.889 /K/ 3.811/d/ 2.316 /w/ 2.957 /tG/ 4.172/m/ 2.451 /k/ 2.961 /DG/ 4.594/h/ 2.506 /sG/ 3.044 /x/ 4.773/P/→[∅] 2.561 /q/→[P] 3.169
The second step is to evaluate whether /q/ is more or less effortful than other
segments, and similarly to determine how it compares in terms of information utility.
10McCarthy (1994) also characterizes the pharyngeals as approximants, which suggests that theeffort-related pressures that target them are substantially different than those that target othersegments.
11Uvular fricatives (but not stops) may pattern with pharyngeals and laryngeals.12When Modern Standard Arabic words that contain /q/ are used in Egyptian Arabic, the /q/
may surface as [q], rather than as [P].
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 72
Based on cross-linguistic evidence, it is safe to assume that /q/ requires more effort
to articulate than all other voiceless oral stops, as the uvular place of articulation is
cross-linguistically marked. The information estimate (measured as informativity) of
/q/ is higher than all other voiceless stops, meaning that we cannot perform binary
comparison with other voiceless stops: /q/ is both more useful and requires more
effort than other voiceless stops. However, /q/ has lower information utility than a
number of fricatives (/z/, /f/ and /x/) and the voiced velar stop /g/, all of which are
more frequent than /q/ cross-linguistically: /q/ has only 52 occurrences in UPSID
(Maddieson, 1984), while /z/, /f/, /g/ and /x/ are more frequent cross-linguistically
(62, 180, 253, 94 occurrences, respectively).13 I can therefore assume that at least
a few of these segments require less articulatory effort to maintain their place of
articulation than /q/ does. The prediction is therefore that /q/ is more likely to
undergo debuccalization than the simpler fricatives would.
It is important to note that binary comparison does not predict that /q/ must
weaken. Instead, it predicts that /q/-weakening is likely to occur in many grammars
in which weakening does occur, as /q/ is more weak than many other segments.
The notion of grammars here is in the Kiparsky (1993) sense of the word: multiple
competing grammars in the mind of speakers. In a multiple-grammar analysis, having
more grammars in which /q/ weakens represents a stronger pressure to weaken, as in
Anttila (1997).
Other accounts would find it difficult to account for /q/-lenition:
• /q/ is not the most frequent nor the least frequent consonant in Egyptian Ara-
bic. Basing an explanation on frequency alone would predict that the most
frequent or least frequent segment would weaken first.
• /q/ is not the most marked segment in Egyptian Arabic, nor the least marked.
There are other uvulars in Egyptian Arabic that do not weaken, and the em-
phatic stops are more marked cross-linguistically. The fact that more marked
segments occur in Egyptian Arabic but may not weaken even in dialects in which
/q/ weakens makes it difficult to use a strict markedness-based account without
marked-faithfulness: more marked segments should have weakened first.
13Since stops are more frequent than fricatives cross-linguistically and voiceless stops are morefrequent than voiced stops, the difference is even more significant.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 73
MULE has an advantage over the alternative accounts in not requiring /q/ to be on
the extreme edge of any single scale. It suffices that the information utility of /q/ is
not high enough to predict that it is a likely candidate for weakening.
3.5.3 Real-valued comparisons in OT
Real-valued comparison in real-valued models of OT
Implementing real-valued comparison (§3.3.2) in OT has to appeal to versions of OT
that can model variation using continuous values, such as Stochastic OT (Boersma,
1998; Boersma and Hayes, 2001) and MaxEnt OT (Goldwater and Johnson, 2003).
Currently, an OT grammar with constraint conjunction and marked faithfulness
(de Lacy, 2002, ch. 6) which is also real-valued would fit weights to each constraint
separately: the markedness of each environment, the markedness of each feature in
that environment, faithfulness to each feature and conjunction of features, etc. In
real-valued frameworks, fitting each constraint with its own weight takes the place
of strict constraint ordering: if the weight w1 of some constraint c1 is significantly
bigger than the weight w2 of another constraint c2, it means that c1 outranks c2. A
MULE version of real-valued OT frameworks can be stricter than current real-valued
frameworks: rather than fit each and every segment or feature with its own weight
w1...n, the learning algorithm can assign a single weight winfo-estimate to the informa-
tion estimate of all linguistic elements, reducing the number of weights that have
to be learned.14 Faithfulness to some segment σ1 with info-estimate (σ1) = x will be
winfo-estimate ·x, and faithfulness to another segment σ2 with info-estimate (σ2) = y will
be winfo-estimate ·y. In essence, the single weight winfo-estimate will only scale information
estimates, and would not allow a reranking of faithfulness in any other way.
Real-valued comparison account for American English /t,d/-weakening
Introduction Real valued comparison can rely on the distance between the infor-
mation estimate of different linguistic elements. If segments that require comparable
14 A single weight suggests a parsimonious model in which information utility is linearly correlatedwith preservation. A less parsimonious model would allow other monotonically increasing functions,but both alternatives are more easily falsifiable than models that fit each weight separately, as theyfit fewer parameters.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 74
effort across different languages provide less information than other segments in one
language compared to other languages, they are expected to be weaker in that lan-
guage. This is the case with English /t/ and /d/. American English has multiple
/t,d/-weakening processes (§3.2.2), and UK English has parallel /t/-weakening pro-
cesses (§3.2.3). A MULE real-valued comparison account would expect /t/ and /d/
to provide less information than they do in other languages.
In order to evaluate the prediction that /t/ and /d/ are indeed weaker in English
than in other languages, the information estimate of the different consonants in En-
glish, Spanish and Egyptian Arabic was evaluated. In all cases the informativity of
the segments was used to assess their information estimate. For the English table
(Table 3.2) informativity was calculated using the word counts from the Switchboard
corpus (Godfrey and Holliman, 1997) and the Buckeye corpus (Pitt et al., 2007).
Phonetic representation was taken to be the one used in the CMU pronunciation dic-
tionary (Weide, 1998). Informativity was calculated using the same method used to
calculate informativity for Arabic segments in §3.5.2. For the Spanish table (Table
3.3) the CALLHOME Spanish Lexicon (Garrett et al., 1996) dictionary was used for
both word counts and phonetic representation, after collapsing a number of phonetic
distinctions: voiced oral fricatives were collapsed with voiced oral stops. Both con-
versational and read (radio transcripts) word counts from the corpus were used to
calculate informativity. The Egyptian Arabic table (Table 3.1) uses the same data
listed in the previous section.
Table 3.2: English Informativity-based information estimate
Phone Informativity Phone Informativity Phone Informativity/N/ 0.231 /s/ 2.464 /j/ 4.083/z/ 1.220 /k/ 2.579 /S/ 4.229/t/ 1.526 /m/ 2.964 /f/ 4.307/d/ 1.565 /D/ 3.350 /h/ 4.604/n/ 1.675 /w/ 3.713 /T/ 4.648/Z/ 1.899 /p/ 3.803 /g/ 4.735/v/ 1.977 /b/ 3.843 /Ã/ 4.758/l/ 2.217 /Ù/ 3.849
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 75
Table 3.3: Spanish Informativity-based information estimate
Phone Informativity Phone Informativity Phone Informativity/R/ 1.551 /l/ 3.042 /z/ 4.165/n/ 1.809 /m/ 3.156 /Ù/ 4.303/s/ 1.997 /N/ 3.389 /h/ 4.748/t/ 2.188 /p/ 3.445 /f/ 5.487/d/ 2.586 /w/ 3.484 /r/ 5.917/j/ 2.933 /b/ 3.543 /S/ 6.071/k/ 3.010 /g/ 3.570
As with binary comparison, the first step is to understand what the relevant effort-
increasing forces that cause the weakening in order to establish a comparison set are:
the phones that successfully resist the same functional pressure to lenite and delete.
For intervocalic positions the relevant comparison set is all oral stops: spirantization
and sonorization affect different subsets of oral stops in intervocalic environments: all
stops except glottal stop and emphatic stops in Biblical Hebrew (Gesenius, 1910), and
all voiced oral stops in Spanish (Harris, 1969). I follow Kirchner (1998) in treating
intervocalic lenition as reduction in effort (but see also objections to that approach
in Kaplan 2010).
Binary comparison makes no predictions regarding /t/- and /d/-weakening, since
they require less effort and have lower information utility than other oral stops. Real-
valued comparison predicts that if a phone has lower information utility in some
language than in other languages, it will be weaker in that language than in other
languages it may appear in. The information estimates of /t/ and /d/ in English are
lower than their information estimate in Spanish and Egyptian Arabic.15 Moreover,
the difference between the information estimate of /t/ and /d/ and the information
estimate of /k,p/, /g,b/ is greater in English (3.27) than it is in Spanish (3.28) or
Egyptian Arabic (3.29). Stochastic OT and MaxEnt OT adapted to fit MULE OT
15Spanish and Egyptian Arabic /t/ and /d/ are dental, but I assume that the associated effort isapproximately the same as their alveolar counterparts.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 76
models are therefore more likely to rank faithfulness to /t/ and /d/ low enough that
they weaken more readily in English than in Spanish or Egyptian Arabic.
(3.27) English
Š Š ŠŠ ŠŠŠ Š
t k b
d p g
(3.28) Spanish
Š Š ŠŠ ŠŠŠ Š
t k b
d p g
(3.29) Egyptian Arabic
Š ŠŠŠ ŠŠ Š
t b g
d k
It is possible to test whether the low information estimate of English /t/ and /d/ is
the reason that leads to its greater likelihood to weaken, by using quantifiable models
of variation such as MaxEnt OT. MaxEnt OT models (Goldwater and Johnson, 2003)
can compare several different outputs, but in this case it is possible to use a simpler
model which contrasts just two alternatives: a word-final or intervocalic consonant is
kept or weakens. Reducing the number of possible outcomes to two allows us to use
logistic regressions, which estimate the ratio between the two forms. If two out of
every three word-final /t/s weakens in a particular environment then the weakening
ratio the regression would try to fit is 2 : 1 (see Bresnan and Nikitina 2009 for more
details on interpreting logistic regressions in the context of linguistic variation). Since
word-final deletion is easier to define and measure than intervocalic weakening, I used
a logistic regression to evaluate word-final deletion of obstruents in American English.
Methods and materials The model was evaluated using a mixed effects logistic
regression. The predicted value was whether a particular word-final consonant was
deleted or not. The comparison set chosen was word-final obstruents, as using only the
oral stops is not desirable in statistical modeling, since it would lead to overfitting
constraints such as place of articulation (three places of articulation or degrees of
freedom for just six phonemes). In order to make the environment more uniform,
the context was limited to post-vocalic, word-final obstruents that were followed by
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 77
a consonant in the following word. Function words were excluded since their word-
final deletion processes may be different than that of content words. Since Blevins
(2004, pp. 208–209) demonstrates that different manners of articulation are subject
to different pressures to delete, the phonological properties of each segment are used
as control variables, and model the manner-specific pressures to delete word-final
obstruents.
Word-final obstruents were collected from the Buckeye corpus (Pitt et al., 2007),
and an edit-distance program determined whether they were deleted. Each word was
assumed to have its CMU dictionary (Weide, 1998) representation. Words that did
not appear in the CMU dictionary were removed, leaving approximately 9000 data
points. Word counts were established using both the Buckeye corpus and the Switch-
board corpus (Godfrey and Holliman, 1997) in order to evaluate how predictable the
final consonant was in its context.
The baseline OT model was controlled for using a number of phonological and
phonetic features. The phonological features in Table 3.4 were included for the word-
final segment and for the first segment of the following word. Information utility at
the word-level was controlled for by including the negative logged probability of the
word in which the consonant appeared (when conditional probabilities are not used
and context is ignored, the logged frequency, the logged probability and informativity
of a linguistic element are all the same). The identity of the speaker and the word in
which the consonant appeared were used as random effects.
On top of the baseline model, the following information-utility constraints were
included in the model:
(a) The predictability of the obstruent given the context it appeared in (3.13; page
63), using all the previous phones in the same word as context. The value used
was the difference between the predictability of the segment and its informa-
tivity, as the original values are highly collinear (informativity is the expected
value of the predictability).
(b) The informativity of the obstruent, calculated as in (3.15; page 64), using all
the previous phones in the same word as context. The actual values can be
found in Table 3.2 on page 74.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 78
Table 3.4: Post-vocalic pre-consonantal word-final obstruent deletion controls
Control Applies to Values
Place of articulation word-final segment labial, coronal, dorsalVoiced word final and following segments true, falseAffricate word final and following segments true, falseDental word final and following segments true, falsePalato-alveolar word final and following segments true, falseNasal following segment true, falseLiquid / Glide following segment true, falseRate of speech number of lexical
segments per second(logged)
Shares place of articula-tion with following seg-ment
word-final segment true, false
Syllable stress word-final segment primary, secondary, nostress
The model was fit using R (R Development Core Team, 2011). Due to the large
number of control variables, redundant variables were removed based on their AIC
(Akaike, 1974) using backward-elimination.
Results The full results are in Table 3.5 As expected, several phonological and
phonetic factors influenced the deletion rates of word-final obstruents. As the rate
of speech increased, speakers were more likely to omit word-final obstruents (p <
0.005). Obstruents at the coda of syllables that had primary stress were less likely to
delete (p < 0.01), but secondary stress was not significantly different than no stress.
Voiced obstruents were less likely to delete than voiceless ones (p < 0.001) and dental
fricatives were more likely to delete than other obstruents (p < 0.05).
Final obstruents in low-probability words were less likely to delete (p < 0.05),
signifying that information-utility operates at several levels, not just that of the seg-
ment. The relatively high p-value of word probability is due to the use of words as
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 79
a random effect. Refitting the model without including words as a random effect de-
creased the p-value of word probability (p < 10−8) (which means that word frequency
is very important).
Importantly, the informativity of the obstruent significantly influenced deletion
likelihood (p < 10−4). Highly informative segments were significantly less likely to
delete. No residual effect of contextual predictability of the segment was found, sug-
gesting that the choice to emphasize the role of informativity in evaluating information
estimates is justified. Additionally, it is interesting to note that the control variables
that treated stops, fricatives and affricates as different from one another did not im-
prove the explanatory power of the model and were removed from the model in the
backward-elimination process. The difference between stops and fricatives becomes
significant if informativity is removed from the model (p < 10−13), which means that
informativity modeled that difference between stops and fricatives in the actual best
model. Similarly, in the final model dorsals were not different than coronals, and
labials were more likely to delete than coronals (p < 0.005), despite the fact that
coronals delete more than both labials and dorsals, as shown in §3.7.4. The lack of
difference between dorsals and coronals, and the increased likelihood of labial deletion
also follow from the inclusion of informativity.
Table 3.5: Post-vocalic pre-consonantal word-final obstruent deletion fixed effects
Variable Estimate Std. Error z value Pr (> |z|)(Intercept) -1.12257 0.98621 -1.138 0.25501Primary stress -0.72394 0.28007 -2.585 0.00974Secondary stress -0.53362 0.56354 -0.947 0.34368Informativity -1.05090 0.24197 -4.343 1.41e-05Word-final is voiced -0.99143 0.28257 -3.509 0.00045Word-final is dorsal -0.13582 0.46377 -0.293 0.76963Word-final is labial 1.49029 0.45980 3.241 0.00119Word-final is dental 2.90575 1.20661 2.408 0.01603Rate of speech 0.46919 0.15641 3.000 0.00270Word frequency -0.09606 0.04489 -2.140 0.03236Following voiced -0.22264 0.13117 -1.697 0.08964Following affricate -0.99216 0.79798 -1.243 0.21375
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 80
Discussion As predicted by MULE, the experimental model shows that as the in-
formation estimate of a segment increases, speakers are less likely to yield to the
pressure to delete. The information estimate proved to be significant after controlling
for possible phonological factors, many of which have been less influential in predict-
ing word-final deletion. This finding demonstrates the advantage of MULE over its
alternatives in providing the right bias that would explain why in English /t/ and /d/
undergo multiple weakening processes, even though both segments appear in many
languages, are unmarked, require relatively little effort, and are usually very frequent.
3.6 The necessity of multiple scales and language-
specificity
3.6.1 The difference between MULE and current theories
The treatment of parallel weakening offered by MULE combines the universal scales
of effort and confusability with a language-specific scale of information utility. In this
section I show why any explanation for parallel weakening has to combine several
scales, and why at least one of those scales has to be language-specific.
3.6.2 Parallel weakening in standard OT
Current prominent theories in phonology cannot easily express what would make a
segment prone to undergo weakening in one language and not in others. While there
are clear ways to account for each and every existing process separately, expressing
proneness to undergo weakening is not trivial. For instance, the standard OT frame-
work (Prince and Smolensky, 1993; McCarthy and Prince, 1995) makes no prediction
with respect to the probability of weakening processes since the standard treatments
of markedness in OT allow targeting any subset of segments for weakening processes.
De Lacy (2002, §6.3) provides many examples of gapped inventories : languages in
which some subset of possible segments undergoes weakening. De Lacy assumes a
markedness hierarchy in which dorsals (K) outrank labials (P), which outrank coro-
nals (T), which outrank glottals (P). Some of the languages weaken only the most
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 81
marked or only the least marked segments, but some weaken only intermediately
marked segments.
In some sound systems such as US English only unmarked segments weaken:
/t/ weakens but /k/ does not. Such sound systems necessitate marked faithfulness:
constraints that are violated when marked segments are weakened, but not when
unmarked segments are weakened. Conversely, the selective weakening of marked
segments in marked environments in neutralization processes requires constraint con-
junction (Smolensky 1995 who cites Smolensky 1993) as in Ito and Mester (2003). A
conjunction of two markedness constraints outranks its conjuncts, but may outrank
or be outranked by faithfulness constraints.
An OT grammar that allows both constraint conjunction and marked faithfulness
can describe weakening processes that target any subset of segments. The following
tableaux demonstrate selective coda weakening. In (3.30) only dorsals lenite, in (3.31)
only labials lenite and in (3.32) only coronals lenite.16 All other combinations in the
power set of {P,T,K} are likewise possible.
(3.30) No dorsal codas
NoCoda Ident{K} Ident{P} NoCoda Ident NoCoda
tat.tap.tak &*K &*P &*T
tat.tap.tak *! * *
taP.tap.tak *! * *
tat.taP.tak *! * * *
+ tat.tap.taP * * * *
taP.taP.tak *! * **
tat.taP.taP * *! ** *
taP.tap.taP * * **!
taP.taP.taP * *! ***
16Max is not shown and outranks all other constraints, disallowing deletion.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 82
(3.31) No labial codas
Ident{K} *NoCoda *NoCoda Ident{P} Ident NoCoda
tat.tap.tak &*K &*P &*T
tat.tap.tak * *! *
taP.tap.tak * *! *
+ tat.taP.tak * * * *
tat.tap.taP *! * * *
taP.taP.tak * * **!
tat.taP.taP *! * ** *
taP.tap.taP *! * **
taP.taP.taP *! * ***
(3.32) No coronal codas
Ident{K} Ident{P} *NoCoda *NoCoda NoCoda Ident
tat.tap.tak &*K &*P &*T
tat.tap.tak * * *!
+ taP.tap.tak * * *
tat.taP.tak *! * * *
tat.tap.taP *! * * *
taP.taP.tak *! * **
tat.taP.taP *! * * **
taP.tap.taP *! * **
taP.taP.taP *! * ***
The reason that every subset of segments can be targeted by weakening is that
currently, markedness in OT has a single universal scale: in every language marked-
ness and marked-faithfulness are identical and share the exact same order (in classical
OT) or constrain the same sets of elements (Kiparsky, 1994; de Lacy, 2002). There-
fore, an OT grammar with a single universal markedness hierarchy can always order
the faithfulness and markedness constraints targeting a particular level of marked-
ness independently of the relative ranking of constraints targeting another level of
markedness. Thus the markedness constraints forbidding dorsals can be ranked above
or below the faithfulness constraint preserving dorsals independently of the relative
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 83
ranking between the markedness constraint forbidding labials and the faithfulness
constraints preserving labials. Since both theories allow every subset of segments
to undergo lenition or deletion, they can make no predictions with respect to which
segments are more likely to undergo weakening.
3.6.3 No single scale can replace markedness
OT uses a single scale for both markedness and marked-faithfulness, and it is the
reliance on a single scale that brings about the lack of predictive power. It is important
to understand that it is very unlikely that any functionally motivated scale could
substitute for the use of markedness, even if that scale were allowed to be language
specific. The contrast between parallel weakening in Arabic and English demonstrates
that problem well.
In Arabic /q/ attracts multiple weakening processes and in English /t/ attracts
multiple weakening processes. It is quite difficult to find any single scale that would
apply similarly to /t/ and /q/. /t/ is unmarked cross-linguistically, pronounced using
the tip to the tongue and is rather frequent in English. /q/, on the other hand, is
marked cross-linguistically, pronounced using the back of the tongue and is relatively
infrequent in Arabic (it is not the least frequent segment either, as it is one of the
most frequent root radicals). I am not aware of any other functionally motivated scale
that would treat the English /t/ and the Arabic /q/ as more similar to one another
than to other segments that do not undergo weakening. Any of the proposed scales:
coronal/dorsal contrast, cross-linguistic frequency and in-language frequency would
predict that one segment should attract weakening and the other should not. OT’s
solution is to offer a configuration in which in some cases the high end of the scale is
targeted and in other cases the low end of the scale is targeted, but as I showed in the
previous section, that solution can describe any weakening pattern, and is therefore
not predictive or explanatory.
3.6.4 Universal scales do not suffice
Functional accounts in linguistics are often based on balancing two or more univer-
sal scales. The principle of least effort in Zipf (1949) argues that speakers attempt
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 84
to minimize their effort in articulation. But if that desire were not constrained by
some other functional force, speakers would arguably say nothing at all, preserving
all effort. Therefore, Zipf argues that speakers’ economy is bounded by their desire to
make themselves understood by their listeners. The speakers’ effort is therefore con-
strained by the listener’s capacity to understand what is said without putting in too
much effort. Similarly, several phonetically motivated theories in phonology (Flem-
ming, 2004; Boersma, 2003) balance articulatory effort with perceptual confusability.
Speakers’ attempt to reduce their effort may hinder the ability of the listener to tell
different segments apart, which limits the speakers’ effort reduction.
In theories that balance different functional forces such as the ones mentioned
above, it is easy to predict that a change that gains on all scales will always be
desirable, and a change that loses on all scales will always be undesirable. But
as Boersma (2003) shows, it is perfectly possible to gain on one scale while losing
on another, leading to a change that is neither desirable nor undesirable. Therefore,
such theories allow for multiple possible equilibria between the functional forces being
balanced. Since articulatory effort and perceptual confusability are based on human
physiology and psychology, the are expected to be universal, and have a similar effect
in all languages. Since multiple equilibria are available for every language to choose
from, each language may assign different importance to each functional force, leading
to different yet linguistically plausible outcomes in each language.
However, if all equilibria are equally plausible, language-specific patterns of par-
allel weakening cannot be predicted. Theories that are based on universal functional
forces would make similar predictions for languages that have similar contrasts among
segments. Languages that have a similar inventory of oral stops such as English, Span-
ish and Modern Hebrew would be expected to choose from a similar set of equilibria
when undergoing change. Spanish and Modern Hebrew would therefore be equally
likely to have a weakening-prone /t/ as English does, but this is not the case. Thus,
the theoretical solution for parallel weakening cannot rely exclusively on universal
forces.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 85
3.7 Information-theoretic accounts
3.7.1 Information-theoretic explanations
MULE builds on several existent advances in using information theory (Shannon,
1948) to account for phonological weakening processes. In this section I review a
number of these accounts, showing their respective strengths and shortcomings, as
understanding their properties is an important step in understanding why I found
it necessary not to rely solely on information-theoretic accounts in order to account
for parallel weakening processes and why I focused on using informativity as the
core value of information estimates rather than on more commonly used information-
theoretic measurements such as frequency, predictability and entropy.
3.7.2 Functional load as entropy
Theoretical outline of functional load and entropy
Hockett (1955) proposes an information-theoretic approach to Martinet’s prediction
that languages would not collapse phonemic distinctions which would result in the
loss of too much information. The information-theoretic approximation was later
extended by Surendran and Niyogi (2006), who also showed that the such approxi-
mations of functional load make the wrong prediction with respect to a number of
complete neutralization cases.
The basic measurement in the quantification of functional load in Hockett (1955)
and Surendran and Niyogi (2006) is entropy. In a linguistic context the entropy of
a language is the expected (mean) predictability of each linguistic element given the
information that is already known to a listener. Consider the partial sentence in
(3.33):
(3.33) An ap. . .
The search engine I am using suggests that what I am really trying to search when
I type (3.33) is an apple a day, but other completions are certainly possible. If we
kept playing this game, guessing one word or one sound at a time, we would be able
to estimate how predictable each word is. If we average the predictability across all
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 86
the cases, the end product would be an estimate of the entropy of English. Shannon
(1951) applies this strategy to the evaluation of the entropy of characters in printed
English.
When a language distinguishes between two or more classes that are treated as
identical in other languages its entropy increases, as the guessing game becomes
harder. In a language in which there is no gender marking, it is easier to predict
what the next pronoun is going to be in a context such as (3.34a), since there is no
need to distinguish between (3.34b) and (3.34c).
(3.34) (a) I have seen the new professori but I haven’t talked to . . .
(b) I have seen the new professori but I haven’t talked to heri
(c) I have seen the new professori but I haven’t talked to himi
The quantification of functional load using entropy relies on the difference in entropy
between a language as it currently is, and a minimally different language in which
some distinction is eliminated from the language. The more the entropy of a language
drops by eliminating a distinction, the more important that distinction is, making it
unlikely that the language would lose that distinction.17 For example, the functional
load of the difference between /k/ and /g/ in English is estimated by deducting the
entropy of a slightly modified English in which every /k/ and every /g/ are replaced
by a single different phoneme, /σ/, from the entropy of real English.
Measuring entropy differences as an functional load has two important implica-
tions:
• It is not possible to evaluate the amount of information of a single linguistic
element, but it is possible to evaluate the information of the difference be-
tween linguistic elements. Changes that do not lead to the loss of information
(that do not collapse distinctions) have a zero cost. Many weakening processes
do not collapse distinctions, making functional load unable to predict which
information-preserving processes would be more likely to occur.
17Hockett (1955); Surendran and Niyogi (2006) also divide the difference by the entropy of theunmodified language, but since for a given language the division operation only scales the measure-ments, I omit it for simplicity’s sake.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 87
• Collapsing two infrequent linguistic elements does not cost as much as collaps-
ing frequent ones. This is caused by counting observed events together with
unobserved events: each case in which a distinction is not lost counts towards
making that distinction unnecessary. In a language in which the ratio between
/t/ and /d/ and /k/ and /g/ is 3 : 1, and the ratio between /t/ and /k/ is
2 : 1, an entropy-based account would predict that collapsing /t/ and /d/ is
worse than collapsing /k/ and /g/. The predictability-based accounts presented
in the following sections predict that the distinction between /t/ and /d/ and
between /k/ and /g/ would be equally important.
As Surendran and Niyogi (2006) observed, functional load makes the wrong pre-
dictions in a number of cases. In the following section I present a corpus-based study
which shows that the same holds for the prediction of final consonant deletion in
English.
Applying functional load to English final consonant deletion
Introduction Deletion processes are a case in which functional load could come
into effect, since deletion may collapse two word forms together. Obligatory final
/t/-deletion could collapse walk and walked, Ben and bent and so forth. Obligatory
final /k/-deletion would affect fewer words, but would also collapse words: make and
may. I attempted to follow the predictions made by the functional load approach and
apply them to the case of final consonant deletion in English.
Materials and Methods It is possible to estimate which segment is more likely
to delete under the functional load account by measuring the difference between the
entropy of normal English, and the entropy of minimally different languages in which
only final /k/, only final /t/, only final /p/ and so forth have been deleted. I used word
counts from the Switchboard corpus (Godfrey and Holliman, 1997) and the Buckeye
Corpus (Pitt et al., 2007) and the representation of these words as it appeared in the
CMU dictionary (Weide, 1998). The entropy was evaluated using the frequency of
each word in Switchboard and Buckeye (a unigram language model). The entropy of
each language was evaluated as in (3.35).
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 88
(3.35)
−∑
word
log2
word occurrences
all word occurrences
Results The results are the differences in entropy between English and the language
in which each consonant was deleted word-finally. See Table 3.6, in which the values
are scaled by a factor of 100.
Table 3.6: Functional load of English with different final consonant deletion, scaledby a factor of 100
Phone Delta Entropy Phone Delta Entropy Phone Delta Entropy
/z/ 7.5248 /p/ 1.7043 /r/ 0.0458/n/ 5.2305 /k/ 1.3200 /N/ 0.0317/s/ 4.6838 /l/ 0.4695 /g/ 0.0275/v/ 4.4194 /þ/ 0.2652 /b/ 0.0202/d/ 3.4752 /f/ 0.1725 /S/ 0.0055/t/ 2.5521 /Ù/ 0.1477 /D/ 0.0027/m/ 2.5133 /Ã/ 0.0855 /Z/ 0.0006
Among the stops, functional load predicts that the greatest amount of information
would be lost if /d/ and /t/ are deleted word-finally, followed by /p/, /k/, /g/ and
/b/. Moreover, deleting word-final /d/s causes the loss of at least 170 times as much
information as the deletion of /b/s.
Discussion The predictions made by functional load are incorrect as /d/ and /t/
are more likely to delete word-finally than other stops. The reason for the incorrect
prediction lies in the way information is measured in functional load accounts. Since
unobserved events are as important as observed events, the fact that very few words
are conflated by the deletion of word-final /b/ or /g/ predicts that word-final /b/ and
/g/ could be deleted without a significant loss of information. Every time a word is
not conflated with some other word by /b/ and /g/ deletion counts towards making
final /b/ and /g/ redundant. The amount of information conveyed by the /b/s and
/g/s that do appear is drowned by the number of cases in which they do not appear.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 89
3.7.3 Frequency
Zipf (1929) expects frequently used segments to weaken more than infrequent ones.
Zipf’s prediction is based on the expectation that when words, morphemes and seg-
ments are used more frequently than others, their articulation is under greater pres-
sure to become efficient. High efficiency would usually mean shorter, reduced and
requiring less articulatory effort. Following the terms used in this chapter, Zipf’s pro-
posal can be viewed as suggesting that frequent segments should be more prone to
undergo weakening than infrequent segments. This view is also expressed in Haspel-
math (2006, 2008), who relies on Zipf’s (1949) notion of frequency-dependent effort
to predict historical morphological change.
The intuitive prediction that frequent linguistic elements should require less effort
goes a long way in predicting weakening processes in American English. In the Buck-
eye corpus (Pitt et al., 2007) /t/ and /d/ are indeed more frequent than any other
oral stop, and they do delete more frequently word-finally and weaken in intervocalic
contexts. Chapter 2 shows that frequency goes deeper than that, as frequency also
successfully contrasts word-medial deletions in spontaneous speech for all oral stops,
as /k/ is more frequent and deletes more frequently than /p/ word-medially, and /b/
is more frequent and deletes more frequently than /g/ word-medially.
While intuitive and appealing, the prediction that frequent elements should weaken
more than infrequent elements has its shortcomings. First, frequency does not man-
age to capture the asymmetry between nasal stops: /N/ is less frequent than /m/,
but deletes more frequently word-medially. Second, it is not the case that the most
frequent segments are the first to weaken cross-linguistically. While in Egyptian Ara-
bic (Kilany et al., 1997) /t/ and /b/ are more frequent than other oral stops, it is
/q/, one of the less frequent oral stops, that undergoes regular weakening to /P/.
As discussed in §3.2.2, coronal-targeting weakening processes are a minority among
all weakening processes, but coronals are often among the most frequent segments a
language has, and appear in more positions than other segments.
Third, purely frequency-based accounts (by definition) do not take effort into
account, but rather expect effort to follow from frequency. Therefore, the proposed
Zipfian account makes sense only if two linguistic elements require the same amount
of effort and differ only by frequency. When both frequency and effort differ, an
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 90
effort-reduction account may well predict that the less frequent linguistic element
should be reduced. Under the assumption that the effort is weighted by frequency,
the combined effort can be approximated by multiplying a linguistic element’s effort
with its frequency. In example (3.36), σ1 and σ2 are segments, σ1 is two times more
frequent than σ2, and requires one third the effort required by σ2 to pronounce. σ2
requires greater weighted effort than σ1 as the effort associated with σ1 multiplied by
its frequency is smaller than the effort associated with σ2 multiplied by the frequency
of σ2. Therefore, σ2 would be under greater pressure to lenite, even though it is less
frequent.
(3.36) Segment Effort Frequency Weighted effort
σ1 e 2f 2ef
σ2 3e f 3ef
The amount of effort weighted by frequency therefore has an optimum that could
be violated by weakening the more frequent element. Zipf’s more general principle
of least effort does not predict that the more frequent segment would weaken first in
this case.
Finally, it is also possible that Zipf’s prediction which was exemplified using words
would not hold for highly frequent linguistic elements such as segments due to ceiling
effects on practice, as opposed to the contrast between frequent and infrequent words.
While it is true that frequent segments are articulated more frequently than infrequent
segments, a native speaker has had a chance to articulate infrequent segments a
substantial number of times, possibly enough to lead to optimized pronunciation for
all segments.
In summary, I argue that while frequency can be demonstrated to be a powerful
predictor of weakening processes, it cannot be used as an exclusive driving force in
actuating such processes, and has to be integrated with other functional forces such as
articulatory effort. One could argue that a minimally different version of MULE could
rely on frequency rather than on informativity in assessing information utility. The
differences between such theories should withstand empirical examinations similar to
the one performed in §3.5.3.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 91
3.7.4 Predictability
Theoretical outline of predictability accounts
A third family of models uses predictability as a force that derives weakening and
other structural changes. A substantial amount of research demonstrates that pre-
dictable linguistic elements tend to have shorter duration than unpredictable ones as
in Jurafsky et al. (2001); van Son and Pols (2003); Aylett and Turk (2004); van Son
and van Santen (2005); Bell et al. (2009); Pluymaekers et al. (2005); Raymond et al.
(2006), to name a few. Hume (2004, 2008) suggests that high predictability leads to
instability which leads to likelihood to weaken. A segment that is more predictable is
therefore predicted to be prone to undergo weakening than less predictable segments.
Final /t/ and /d/ have been demonstrated to delete more often when they are pre-
dictable from context than when they are not, as in the data presented by Guy (1991,
table 1). Predictability-based accounts correctly predict higher deletion rates for the
/t/ of ‘kept’ than for the /t/ of ‘walked’ since the past form is already conveyed by
the strong-verb inflection of ‘keep’ in ‘kept’, making the /t/ that follows ‘kep–’ more
predictable.
The advantage of predictability over frequency is in shifting the focus from the
mechanics of communication alone to the goal of communication. Like frequency, pre-
dictability can be interpreted as reflecting on the psychological reality of language, for
instance if predictable linguistic elements are easier to retrieve than less predictable
linguistic elements. Unlike frequency, predictability can also be interpreted as an
abstract quantification of the information speakers try to transmit across a commu-
nication channel (Aylett and Turk, 2004; Levy and Jaeger, 2007; Jaeger, 2010). If
speakers and listeners share probabilistic knowledge about the language and the con-
text of the utterance, speakers may reduce the utterance time of predictable (and
therefore redundant) information. The reduction in utterance time can in turn lead
to weakening.
The criticism applied in the previous section with respect to Zipf’s prediction that
frequent segments should weaken before less frequent segments can be used to argue
against relying exclusively on predictability as well. The problem is in determining
the optimal point beyond which no further optimization is necessary. Some segments
are shorter and less salient than others due to their phonetic properties. When these
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 92
short segments are also predictable in some context, it might be that they should not
be further reduced.
Another issue is that predictability-based accounts expect speakers to strive to
achieve the most efficient communication in the level of each and every utterance:
an unpredictable /t/ should be kept while a predictable /p/ should be eliminated.
But this assumption is contradicted by the exceptionless properties of sound change
– since American English has an optional rule that deletes word-final /t/s, but does
not have an equivalent rule that deletes word-final /p/, unpredictable /t/s may be
deleted in cases where predictable /p/ would be kept. This prediction is tested in the
following section.
Predictability in American English final stop deletion and duration
Introduction Under predictability accounts segment deletion is motivated by its
redundancy. In other words predictable segments are deleted. If segment deletion is
motivated by that segment’s predictability alone, then predictable /t/s should delete
just as frequently as predictable /p/s and /k/s.
Methods and materials Post-vocalic word-final stops were collected from the
Buckeye corpus (Pitt et al., 2007), and an edit-distance program determined whether
they were deleted. Each word was assumed to have its CMU dictionary (Weide, 1998)
representation. Words that did not appear in the CMU dictionary were removed.
Word counts were established using both the Buckeye corpus and the Switchboard
corpus (Godfrey and Holliman, 1997) in order to evaluate how predictable the final
consonant was in its context. Cases in which the word-final stop was followed by a
homorganic segment in the following word were excluded. Since very few words had
final /b/ or /g/, all voiced stops were excluded as well.
The resulting set was further limited in one case to completely redundant stops
(that is, segments that are completely predictable from the preceding segments) such
as the /k/ in alcoholic, the /p/ in gossip and the /t/ in closet. In another case, the
resulting set was restricted to segments that were less than 1 : 25 likely to follow the
preceding context such as the /k/ in kick, the /p/ in hop and the t in bat. The latter
set will be labeled the surprising set in order to avoid confusion.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 93
The number of segments that were transcribed as missing was compared with
the number of the segments that were transcribed as non-missing (though possibly
reduced). The resulting contingency tables were evaluated using Fisher’s exact test
(similar to χ2, but necessary because some of the cells have low counts). The null
hypothesis is that there is no difference between /k/, /p/ and /t/. Rejecting the
hypothesis would mean that the deletion ratios of the stops differ significantly.
Results
(3.37) Redundant (predictable) voiceless stops:
Stop Not deleted Deleted
p 29 4
t 195 12
k 128 4
The null hypothesis could only be rejected with marginal significance p < 0.1.
The deletion rates of the redundant voiceless stops /k/, /p/ and /t/ are marginally
different from one another.
(3.38) Surprising (unpredictable) voiceless stops:
Stop Not deleted Deleted
p 28 0
t 22 5
k 71 0
The null hypothesis was rejected with p < 0.0005. The deletion rates of the
surprising voiceless stops /k/, /p/ and /t/ differ from one another.
Discussion It is impossible to conclude that the three voiceless stops have differ-
ent deletion rates when they are completely redundant, though there is a trend that
suggests that /t/ deletes more than /p/ and /k/ in these cases. All redundant stops
delete, as predicted by predictability-based accounts. However, the difference emerges
among surprising (unpredictable) voiceless stops. While, as predicted by predictabil-
ity accounts, surprising /p/ and /k/ do not delete when they are surprising in the
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 94
context they appear in, /t/ does delete even when it is surprising, as follows from the
exceptionless properties of sound change. The difference between surprising stops is
not predicted by predictability-based accounts. It is therefore difficult to attribute
the difference in the deletion ratios of word-final stops to predictability alone.
Furthermore, a post-hoc analysis of deletion rates of /p/, /t/ and /k/ across
redundant (3.37) and surprising (3.38) conditions, shows that /p/ and /k/ do not
differ significantly across the two conditions (p>0.1) and (p>0.25) respectively, but
/t/ deletes more frequently when it is surprising than when it is redundant (p < 0.05).
That /t/ would delete more when it is surprising than when it is redundant is not
predicted by predictability-based accounts (but see Raymond et al. 2006 who found
a similar effect for word-medial /t/ and /d/ in coda positions).
One of the interesting differences between (3.37) and (3.38) is the frequency of /t/
in each of these tables. While among the redundant voiceless stops /t/ is the most
frequent, it is the least frequent among the surprising voiceless stops. If speakers were
recording how redundant a segment is across all contexts, they could easily note that
/t/ is more frequently redundant than surprising. This observation lies at the heart
of informativity accounts, discussed in the following section.
3.7.5 Informativity accounts
In an attempt to account for the exceptionless tendencies of word-medial consonant
deletion, in chapter 2, I use a different information-theoretic approach. Rather than
rely on the actual predictability of a segment (3.39), speakers assess how important
a segment is by relying on how predictable it is on average (using the expected value
of the predictability), their informativity (3.40).
(3.39)
− log Pr (segment|context)
(3.40)
E (− log Pr (segment))
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 95
In information theory (Shannon, 1948) predictability is equated with providing
little unknown information, and being unpredictable means providing more informa-
tion. A segment that is usually predictable is therefore said to have low informativity
and a segment that is usually unpredictable has high informativity. The prediction
is that segments that are less informative would lenite more than segments that are
highly informative everything else being equal.
This prediction is borne out in American English, as /t/ is the least informative
oral stop, and /d/ is the second least informative oral stop. Like frequency, informa-
tivity captures the asymmetries between /k/ and /p/ and between /b/ and /p/ (see
§3.7.3) but also manages to capture the asymmetry between /N/ and /m/: /N/ is less
informative and deletes more than /m/, despite being less frequent than /m/. Pi-
antadosi et al. (2011) show that word informativity approximates word length better
than word frequency in a range of languages.
The main problem with predicting weakening using informativity alone is that
a weakening process may be biased to target informative segments, not only unin-
formative ones. Calculating the informativity of segments in Egyptian Arabic using
the LDC Egyptian Colloquial Arabic Lexicon (Kilany et al., 1997) shows that /q/
– a target of many weakening processes in Arabic dialects – is one of the more in-
formative segments in Egyptian Arabic (see Table 3.1 on page 71). The problem
is not Arabic-specific and is likely to reoccur whenever informativity is applied to
infrequent segments, since informativity is highly correlated with logged frequency:
frequent segments are more predictable, other things being equal. Like frequency,
informativity has to be evaluated while taking effort into account.
3.7.6 Why information-theoretic accounts do not suffice
The fundamental problem shared by frequency, predictability and informativity ac-
counts is the existence of optimal amounts of weakening that should not be exceeded.
It is possible to weaken an element beyond what its frequency, predictability or in-
formativity would predict. Predictability and informativity accounts (and some fre-
quency accounts) rely on covert or overt assumptions about either recoverability or
redundancy avoidance (Aylett and Turk, 2004; Levy and Jaeger, 2007). The recov-
erability of predictable elements is greater than the recoverability of unpredictable
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 96
elements, other things being equal, since predictable elements make a “better guess”
than unpredictable ones (Haspelmath, 2008, §6.1). Likewise in information-theoretic
terms predictable elements provide less information than unpredictable elements, and
are therefore more redundant, other things being equal. However, if such pressures
have always existed, they should already be reflected in today’s sound systems: pre-
dictable and uninformative segments would already have weaker cues and shorter
durations before any new weakening process applies. In the Buckeye corpus (Pitt
et al., 2007) the mean duration of onset /t/s that surface as [t]s (/t/→[t]) is shorter
than the mean duration of /p/→[p] and /k/→[k], and the mean duration of /d/→[d]
is shorter than the mean duration of /b/→[b] and /g/→[g]. In this setting, /t/ should
only weaken more if it is not short enough. Beyond that point there is no functional
reason for information-theoretic accounts to predict weakening.
In summary, information-theoretic accounts do not suffice for predicting parallel
weakening processes. The inability of information-theoretic accounts to predict which
segments would be prone to undergo weakening is evident in the fact that segments
that undergo weakening can be frequent, predictable and have low informativity such
as English /t/, but can also be infrequent, unpredictable and have high informativity
such as Arabic /q/. However, using information-theoretic measurements to account
for the functional pressure to preserve information and withstand effort-avoidance
does have predictive power, as the experiments discussed in this chapter have shown.
3.8 Variable deletion rates of stems and affixes
3.8.1 The contrast between American English and Puerto
Rican Spanish
MULE uses information utility as a functional force that motivates the preservation
of linguistic elements. It therefore faces similar criticism as other functional accounts.
Labov (1994, ch. 19), for instance, criticizes functional approaches on a case-by-case
basis.
One of Labov’s concerns is the contrast between the case of final /t/-deletion in
American English as described in Guy (1991) and the case of final /s/-deletion in
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 97
Puerto Rican Spanish as described in Poplack (1980). In American English word-
final /t,d/-deletion varies with respect to the functional properties of the /t/ and /d/
in question: stem-final /t,d/ delete more frequently than past-morpheme /t,d/. For
semi-weak verbs in which both the stem changes and an affix is added such as ‘kept’,
deletion rates are higher than for the past morpheme, but lower than for stem-final
/t,d/. In Puerto Rican Spanish stem-final /s/ deletes less frequently than affixed /s/.
Additionally, among /s/ affixes, the deletion rates of plural markers vary – plural
markers on nouns delete less frequently than plural markers on adjectives. Labov
claims that functional arguments should predict the same pattern of weakening for
American English and Puerto Rican Spanish. Instead, in American English affixed
/t/ deletes less frequently than stem-final /t/, but the pattern is reversed in Puerto
Rican Spanish, in which stem-final /s/ deletes less frequently than affixed /s/.
I will not address all the concerns raised in Labov (1994), but I will show why
MULE does not predict an identical pattern for affix vs. stem deletion patterns in
different languages, and constrast the predictions made by MULE in the English and
Spanish cases.
In MULE the actuation of segment-deleting weakening processes is motivated by
balancing effort with information utility. For both the American English final /t/-
deletion case and the Spanish final /s/-deletion processes, the segments in question are
the same across all conditions (/t/ in American English and /s/ in Spanish). There is
therefore no need to take effort into account, and information utility has to account for
the variable deletion rates across the different conditions. Up until now, I considered
two sources for information utility – the information of the segment (modeled using
informativity and local predictability) and the information of the word (modeled using
word frequency). It is not unreasonable to assume that different morphemes carry
their own information. Under this proposal, speakers assign information to stems,
past tense markers and plural markers.
I adopt here a simple approach for combining the different sources of information
that contribute to the amount of information a linguistic element holds. I assume
that the information a linguistic element holds is a weighted sum of all the sources
contributing to its information. The information estimate of a /t/-morpheme is there-
fore a weighted average of the information the phoneme holds and information the
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 98
morpheme holds.
In order to establish whether stem-final /t/ holds more information than the past
morpheme /t/ in American English, the information estimate of the past tense mor-
pheme has to be evaluated across the usage patterns of American English. Similarly,
for each stem, the information estimate of the stem should be calculated and divided
among its segments. For instance, the information estimate of the /t/ in ‘walked’
will be estimated along the lines of (3.41), and the stem-final /t/ in the word ‘just’
would be estimated along the lines of (3.42). In the following examples ws is a weight
assigned to segment information estimate, and wm is a weight assigned to morpheme
information estimate.
(3.41)
ws · info-estimate (/t/) + wm · info-estimate (past tense morpheme)
(3.42)
ws · info-estimate (/t/) + wm ·info-estimate (the word ‘just’)
4
The question is therefore whether the information estimate of the /t/ in a particular
word is bigger or smaller than the information estimate of past tense /t/.
The study in §3.5.3 showed that as the amount of information contained in a
particular word increases, so do the odds of avoiding word-final deletion in that word,
in line with the expectation that a high word information estimate would decrease
the likelihood of deleting a segment in that word. However, it is still necessary to
establish how much information the past morpheme encodes and compare it to other
words in English. In order to compare the deletion rates of the English past tense
morpheme to the Spanish plural markings, it is necessary to establish how much
information the Spanish plural morpheme encodes.
3.8.2 The information of English verbal -ed morpheme
Introduction The goal of this study is to establish the amount of information
contained in the past tense morpheme in American English so that it can be compared
with the amount of information contained in individual words.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 99
Methods and materials In order to estimate how much information the past
tense morpheme encodes, it is necessary to establish how predictable it was in the
context it appeared in. In order to do that, I used -ed suffixes that appeared in
words that WordNet (Miller, 1995) listed as exclusively verbs. The word walked is
listed exclusively as a verb and was therefore included, but fixed is also listed as an
adjective, and was excluded from the estimates. The estimate uses the two preceeding
words and the stem (walk for walked) as context, and evaluates how unpredictable
the /-ed/ suffix was in the context it appeared in. The list of the three words and
their counts was taken from the Google Web 1T 5-gram Corpus corpus (Brants and
Franz, 2006), based on the 3-gram files (three words and their counts across the web).
For the sample data in (3.43), the calculation was (3.44). The information estimate
of the past tense morpheme was estimated to be the weighted average of all the cases
in which it appeared (its informativity).
(3.43)
Data Context Suffix count
outside and walk outside and walk ∅ 3450
outside and walked outside and walk ed 1523
outside and walking outside and walk ing 522(3.44)
− log2 Pr (-ed|outside and walk) = − log2 Pr1523
522 + 1523 + 3450
Results The information estimate of the -ed morpheme using a two word and stem
context was 9.952 bits. By comparison, the information estimate of /t/ is only 1.526
(calculated as in §3.5.3 (table 3.2).18 Had the -ed suffix been a separete word, it
would have been the ≈130th most common word, well below most of the function
words, as well as a number of content words such as go, say and people.
Discussion It is highly unlikely that a distribution of information across a mor-
pheme would lead to a word-final stem /t/ having a higher combined information
estimate than a /t/ that appears as part of the -ed suffix. Even words that appeared
18The information estimates for English segments was evaluated using a unigram model, whichmeans that the model could not use context that may be available to speakers, yielding higherinformation estimates than in a trigram model like the one used in the current study.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 100
only once in the corpora used in §3.5.3 had an information estimate of 21.427. Fol-
lowing the assumption that the information estimate of a morpheme is distributed
between the linguistic units that comprise it, a word-final stem /t/ in those rare words
would have to get at least 9.952/21.427 = 0.464 or more than 45% of the amount of
information that the entire stem contains in order to resist deletion better than the
-ed suffix.
Under the assumptions that the information of a linguistic element (a word in this
case) is a weighted average of the information of its parts, and that the information of
morphemes is distributed among the segments that comprise the morpheme, MULE
predicts that an English stem-final /t/ would delete more than a /t/ in the -ed suffix.
The question that remains is therefore whether MULE has different predictions for
the Spanish plural -s suffix. In the following section I will attempt to answer this
question by measuring how much Spanish plural -s morpheme holds.
3.8.3 The information of Spanish plural -s morpheme
Introduction Paralleling the previous study, this study aims to establish the amount
of information contained in Spanish plural /-s/ and compare it to the amount of in-
formation contained in individual words.
Methods and materials The method used in this study is similar to the one used
in the previous study, with the following differences. Parts of speech were assumed to
be the ones used in CALLHOME Spanish Lexicon (Garrett et al., 1996). Data was
collected separately for nouns and adjectives. Predictability in context was established
using the Google Web 1T 5-gram, 10 European Languages corpus (Brants and Franz,
2009).
Results The information estimate of the -s morpheme using a two word context
was 1.856 for the plural morpheme of adjectives, and 2.652 for the plural morpheme
of nouns. By comparison, the information estimate of /s/ in Spanish is 1.997.19 Had
either of the Spanish plural -s adjectival and nominal morphemes been words, they
would have been the most frequent words in Spanish.
1930% higher than the information estimate of /t/ in English, but still the third lowest in Spanish.
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 101
Discussion The information estimates of the nominal plural morpheme is higher
than the information estimate of the adjectival plural morpheme. Everything else
being equal, the estimates predict the likelihood to preserve the plural -s marker as
reported by Poplack (1980).
Additionally, the information estimates of all plural -s in Spanish (< 3 bits of infor-
mation) are significantly lower that the information estimates of the English -ed verbal
suffix (≈ 10 bits of information). Since the information estimates of the different plu-
ral -s are significantly lower than every word in Spanish, there are many more ways
to distribute the information of the entire word across the segments that comprise it
and still have higher information estimate than that of the plural -s markers. The
median information estimate in the CALLHOME Spanish Lexicon corpus is 20.137.
The ratio between the information estimate of nominal plural morpheme (2.652) and
the median information estimate for words in Spanish is 2.652/20.137 = 0.132. There-
fore, if the stem-final /s/ shares more than 14% of the information estimate of the
stem it is part of, it can resist weakening and deletion better than the nominal plural
morpheme -s.
Though the data does not entail that Spanish would have the reversal of the
English case, which would reflect lower likelihood of stem s-deletion than morpheme
s-deletion, it does suggest that such a reversal is likely. Moreover, the correct order
of deletion likelihoods between nouns and adjectives is predicted.
3.8.4 MULE’s predictions are measurable
Labov (1994) argues convincingly that functional explanations for variably applied
rules such as word-final /t-deletion/ in American English should hold cross-linguistically.
He argues that if a linguistic asymmetry such as the asymmetry between the deletion
of morphemes vs. stems is reversed across languages, the asymmetry cannot be used
as an argument for variable application of a linguistic rule. It is important for any
account that is based on functional forces to withstand such criticism, and in this
section I showed that MULE sucessfully does so.
In this section I showed that (a) MULE does not make the same predictions for
cases of word-final /t/-deletion in American English and word-final /s/-deletion in
Spanish, (b) that MULE does suggest that the deletion of past tense /-t/ morpheme
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 102
in English erases more information than the deletion of plural /s/ in Spanish, and
(c) that MULE correctly predicts that plural markers on Spanish nouns would delete
less frequently than plural markers on Spanish adjectives. Importantly, I sketched
a principled method for testing the predictions made by MULE in other cases of
variably applied rules cross-linguistically.
3.9 Conclusion
In this chapter I presented a new theoretical approach, MULE. MULE integrates
two well-known ideas in linguistic theory. The first claims that speakers’ linguistic
behavior is influenced by the amount of information linguistic elements contains,
and the second that speakers attempt to reduce their articulatory effort while still
providing their auditors perceptual cues to keep linguistic elements distinct. In MULE
the information utility of linguistic elements motivates keeping linguistic elements
distinct from one another and justifies the expenditure of effort to achieve that goal.
MULE has several implications, but this chapter focused on one – predicting
the distribution of weakening processes in language, addressing Weinreich et al.’s
(1968) actuation problem with respect to weakening processes. MULE predicts that
when the information utility of linguistic elements is not high enough to justify the
expenditure of effort that is required by its perceptually distinct pronunciation such
elements would be under a pressure to weaken. MULE does not predict at what time
a pressure to weaken would evolve to an actual weakening process, nor what output
will be chosen for the weakening process (e.g. spirantization or tapping), but it does
predict which weakening processes are licensed in a given language at a given time.
As information utility promotes the preservation of linguistic elements, it lends
itself to be used as an approximation of faithfulness in OT. Likewise, effort can be used
to approximate markedness. I used OT implementations of MULE to account for the
language-specific distribution of weakening processes in American English and various
dialects of Arabic. The prediction of language-specific weakening processes pulls the
actuation of weakening processes from the extra-linguistic “wastebasket” (pragmatics
is just one of those) and reopens such processes for investigation in linguistic theory.
I presented several data-oriented studies that support MULE. I used a regression
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 103
study that showed that word-final obstruent deletion in American English is sensitive
to the informativity of the final obstruent, more so than to phonological factors such
as place of articulation, and information-theoretic factors such as predictability. I also
showed experimentally that a functional load approach to the same problem yields
wrong predictions. Similarly, I demonstrated that word-final deletion in American
English necessitates the use of informativity rather than predictability, as unpre-
dictable /k/ and /p/ deleted less word finally than unpredictable /t/. Together those
studies show that balancing effort and information estimates is more predictive than
other accounts.
While this chapter tackles only weakening processes, MULE is by no means limited
to predicting weakening. MULE’s key insights, that information is always useful, and
that balancing information with effort is crucial for the correct prediction of linguistic
phenomena, can be used to account for other linguistic phenomena in phonology,
morphology, syntax and psycholinguistics.
In phonology, MULE has several predictions that have not been discussed in this
chapter. First, MULE predicts that the linguistic elements in prominent positions
(prevocalic positions, stressed syllables) will provide more information than those that
are located in less prominent positions (codas, unstressed syllables). Like language-
specific weakening, such phenomena would be difficult to explain in other frameworks.
Second, phonological processes that result in added effort for perceptually motivated
reasons such as epenthesis and fortition are also expected to follow from an attempt to
preserve information (and add as little superfluous information as possible). Finally,
MULE provides a framework that can explain how a language that has a number of
coda-deletion processes may eventually lose all codas (Blevins, 2004), as any language
that severely restricts the inventory of segments that can appear in coda positions
decreases the information content in codas, which may eventually lead to the elision
of all codas.
All of MULE’s predictions apply to morphology and syntax in much the same way
as they apply to phonology. Affixation and cliticization are expected to follow from
low information content, and information contentful affixes and clitics are expected
to require better perceptual cues and allow a greater expenditure of effort than those
that provide less information, extending Haspelmath (2008). However, MULE has
CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 104
an additional prediction that can be tested empirically. Information preservation in
MULE focuses on the amount of information speakers estimate linguistic elements
contain, based on exposure to such elements. Therefore MULE assigns information
only to observed linguistic elements. As such, unobserved linguistic elements such
as zero affixes and other inaudible elements would not be assigned information util-
ity. Therefore, MULE predicts an asymmetry between observed (marked) linguistic
elements and unobserved linguistic elements. A word with no plural agreement in En-
glish will be regarded as lacking plural marking, not as being zero-marked as singular.
Evidence for the predicted asymmetry may come from psycholinguistic experiments
or from environments in which an element cannot be marked for some property (mass
nouns and prototype readings for noun plurality, missing subject for verb agreement
etc.).
For psycholinguistics, MULE has several different predictions than those assigned
by state of the art information theoretic accounts. First, MULE focuses on the
speaker, not on speaker-listener interaction. It therefore expects speakers to follow
its prediction (putting more effort into elements with high information utility) even
in the absence of a listener. Second, MULE assumes that speakers manipulate their
articulatory effort in response to the information utility of linguistic elements, while
several other information-theoretic accounts (Aylett and Turk, 2004; Levy and Jaeger,
2007; Jaeger, 2010) assume that speakers manipulate articulation time. The difference
is expected to emerge in linguistic elements that require longer articulation time to
begin with, and in slowed-down language production such as typing. Cohen Priva
(2010) showed that frequency effects do appear in typing, and it would be interesting
to see whether predictability and informativity effects emerge in typing as well.
Chapter 4
Lexicon, usage and information
4.1 Introduction
Chapter 3 showed that the balance between effort and information utility can lead to
the actuation of weakening processes. Segments whose information utility is not high
enough to justify the expenditure of effort that is required by their perceptually dis-
tinct pronunciation will be under a pressure to weaken. Chapter 3 focused on weaken-
ing processes and treated articulatory effort and perceptual distinctness as one factor.
But the three-way interaction between information utility, perceptual distinctness and
articulatory effort can take other forms. Perceptually prominent positions make ideal
locations for the placement of segments whose information utility is high. By placing
highly informative segments in perceptually prominent positions, a language can make
it easier to transmit the information carried across a communication channel. Any
positive correlation between perceptual prominence and information utility would in-
dicate that information utility affects not only performance-related phenomena such
as segment duration and deletion (chapter 2), or change and competence-related phe-
nomena such as the actuation of weakening processes (chapter 3), but also the lexicon
and usage patterns of language.
This chapter focuses on stressed syllables, a perceptually prominent position
(Beckman, 1998; Steriade, 1997; Smith, 2002; Giavazzi, 2010, among others). Stressed
syllables (stress domains in Giavazzi 2010) benefit from several phonetic differentia-
tions from unstressed syllables. They are louder, longer and their vowels have greater
105
CHAPTER 4. LEXICON, USAGE AND INFORMATION 106
sonority. In Beckman (1998) and Smith (2002) phonetic prominence is taken to be an
inherent property of the stressed syllable. One consequence of phonetic prominence is
that stressed syllables potentially exhibit more contrasts and resist phonetic neutral-
ization better than less prominent positions (Beckman, 1998). Phonetic prominence
also leads to phonological prominence effects, such as high nucleus sonority and low
edge sonority (Smith, 2002). For Giavazzi (2010) the goal of the phonetic differ-
entiation from unstressed syllables is to make stressed syllables more phonetically
prominent than unstressed syllables. Speakers put in this additional effort in order
to preserve the metrical structure of the language (Hayes, 1995).
For all current phonological accounts the phonetic prominence of stressed sylla-
bles is not conditioned on the syllable’s content nor on the information utility of its
components. However, since stressed syllable vowels allow more contrasts and block
neutralization processes, it is not surprising that stressed syllables tend to be longer
and provide more information about a word’s identity (Piantadosi et al., 2009). The
question posed in this chapter is a different one: While stressed syllable nuclei tend
to have more contrasts than unstressed syllables (opposite cases exist), it is rarely the
case that stressed syllable onsets allow more contrasts. Beckman (1998) mentions a
single case of neutralization in the onsets of stressed syllables. Giavazzi (2010, §3.4.2)
argues that such processes are more common, but the two cases she provides (from
Italian and Finnish) block neutralization following a stressed syllable. Therefore,
there are few phonologically grounded reasons to believe that the onsets of stressed
syllables hold more information than onsets of stressed syllables. However, if the
onsets of stressed syllables are more perceptually prominent for metrical reasons, lan-
guage could make use of that fact and preferentially place highly informative segments
in those perceptually prominent positions.1 Adams et al. (2009) show that this pre-
diction holds for German – highly informative segments are more likely to be found
in the onsets of stressed syllables. The goal of this chapter is to ascertain whether
the prediction holds that highly informative segments would be preferentially placed
in the onsets of stressed syllables than in the onsets of unstressed syllables.
1Chapter 2 shows that onsets of stressed syllables in American English tend to have longerduration and are less likely to delete.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 107
I test the relationship between stressed syllables’ phonetic prominence and infor-
mation content in three languages, after controlling for possible phonological expla-
nations. I show that in American English, Egyptian Arabic and Spanish, information
utility is positively correlated with the likelihood to appear in stressed syllables. This
correlation shows that languages do indeed make use of stressed syllables’ prominence
to improve the transmission of information, and that information utility does have an
effect on the lexicon and the usage patterns of language. As Giavazzi (2010) predicts,
phonological factors such as place of articulation are not consistently correlated with
stressed syllables – even if they do increase the likelihood to appear in a stressed
onset in some languages, that tendency disappears or is even reversed in other lan-
guages. In fact, with the exception of stop / fricative contrast, no phonological factor
is consistently correlated with positional prominence.
The finding that highly informative segments are preferentially placed in the on-
sets of stressed syllables has consequences outside of phonology by showing that in-
formation theoretic factors shape the lexicon and usage patterns of language. Like
similar studies (Piantadosi et al., 2009, 2011), the fact that the lexicon of languages
is adapted to improve the transmission of information cannot be regarded as a by-
product of linguistic performance. I discuss this consequence at length in §4.4.
4.2 Methodology and sources of data
4.2.1 The choice of test cases
In the three studies described below I investigate whether American English, Egyptian
Arabic and Spanish preferentially place highly informative segments in the onsets of
stressed syllables as German does. The reason that any one language does not suffice
is that any one language may prefer to place segments of some kind in stressed
syllables. It is only if several languages behave the same way that the argument can
be convincing. If a property promotes prominence in one language but not in others,
it should be regarded as language-specific.
The choice of stressed syllable onsets rather than other perceptually prominent
environments (Beckman, 1998) attempts to sidestep the possible reversal of the causal
relationship between prominence and information. Some prominent positions tend
CHAPTER 4. LEXICON, USAGE AND INFORMATION 108
to have high information content due to the organization of language. Languages
tend to have more roots than affixes, and therefore everything else being equal roots
(or at least root-initial positions) are less frequent, less predictable and hence more
informative than affixes. Similarly, stressed syllables tend to allow more vowels than
unstressed syllables, and on similar grounds will tend to have more information than
unstressed syllables. For the onsets of stressed syllables no reversed-causality exists.
Additionally, in order to have relatively uniform conditions for the corpus studies I
focus on intervocalic segments, which may be followed by a stressed vowel (prominent)
or unstressed vowel (less prominent).
Languages in which a stress / unstressed contrast can be tested must fulfill several
conditions. First, they must have a stress system. In addition, there must exist pho-
netically annotated spoken corpora that include stress-related data and an estimate of
word frequencies. Finally, the languages chosen should have relatively different phono-
tactics and phonological constraints from one another. The languages I chose were
Spanish, for which phonetically annotated corpora were available, the CALLHOME
Spanish Lexicon (Garrett et al., 1996), and Egyptian Arabic, for which a similar
corpus exists, the LDC Egyptian Colloquial Arabic Lexicon (Kilany et al., 1997). I
generated similar data for English, using the CMU dictionary (Weide, 1998) for pho-
netic annotation and the Switchboard (Godfrey and Holliman, 1997) and Buckeye
(Pitt et al., 2007) corpora for word counts.
4.2.2 Statistical models
For all three models presented in this chapter, the following choices were taken. All
models used logistic regressions in which the predicted variable was whether the
segment appeared in a prominent environment (the ratio between prominent and
non-prominent environments). Every occurrence of every segment in every word was
considered, provided that it appeared in intervocalic contexts. Since the model tests
for usage preferences, each data point was weighted by its frequency in the language.
Thus, a segment in a very frequent word was taken to be more important than a
segment in an infrequent word.
First, a purely phonological model was evaluated to establish a baseline. Giavazzi
(2010) convincingly shows that the range of processes that apply to the edges of a
CHAPTER 4. LEXICON, USAGE AND INFORMATION 109
stressed domain (the consonants that precede and follow the stressed nucleus) is very
limited, and can apply solely to features that are affected by the increase in sub-glottal
pressure in stressed syllables (and therefore affect not only the onsets of stressed
syllables but also the onsets of the following syllables). Giavazzi lists the features
spread glottis, constricted glottis, voicing, continuant, strident and delayed release
as sensitive to stress. Additionally, phonetic features such as consonant duration
and VOT are typically affected by stress. Stress is not supposed to affect features
such as place of articulation or nasalization of consonants. Not all of these effects
affect the onset of stressed syllables. Assibilation of Finnish /t/ in verbal affixes is
prevented in the onsets of unstressed syllables following the stressed syllable (Anttila,
2006). I did not exclude from the model phonological features which are not predicted
to be affected by stress. Previous accounts have described prominent positions as
licensing position-specific markedness constraints, and could therefore predict that
other marked values such as non-coronal place of articulation might also be more
frequent in stressed syllables.2 The pure phonological model might therefore be able
to distinguish between predictions made by an extension of the Beckman (1998)
account and Giavazzi (2010) account.
The baseline phonology models were selected using the step() function (Hastie
and Pregibon, 1992; Venables and Ripley, 2002) in R (R Development Core Team,
2012) using forward / backward search. The search used the variables in table 4.1
and the distance of the segment from the beginning of the word and from the end of
the word, measured in segments, and logged.3 Distance from word-edge is supposed
to provide a rough approximation for a language’s preference to place stress word-
finally or word-initially, and reduce the correlation between the amount of information
a segment holds and its distance from the word’s edges. The emphasized factors are
the ones that are predicted by Giavazzi (2010) to be affected by stress.
The logistic regression model is of the same family of models as MaxEnt OT
(Goldwater and Johnson, 2003). It allows its constraints to “gang up”: a strong
constraint can be overcome by several weaker constraints. Therefore, it uses a weak
2Beckman (1998) listed stressed syllables as a prominent position, but did not consider the onsetsof stressed syllables as a prominent environment in their own right. Subsequent accounts (Smith,2002; Giavazzi, 2010) discussed onset-specific effects.
3Notice that the variables in table 4.1 are not pure phonological variables in order to reducecollinearity. Only oral stops are ‘stops’, only voiced obstruents are ‘voiced’.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 110
Table 4.1: Segment phonological properties∗
Variable Value Segments Comment
Place glottal /P,h/
coronal /t,d,D,T,s,z,>tS,
>dZ,l,r,R,ô/
labial /p,b,f,v,w/dorsal /k,g,x,ñ,j,K/radical /q,è,Q/ Standard Arabic /q/ is listed as rad-
ical following Kilany et al. (1997)
Strident binary /s,z,S,>tS,
>dZ,sG/
Palatal binary /j,S,>tS,
>dZ,ñ/
Interdental binary /T,D,DG/
Voicing binary /b,d,g,v,z,>dZ,dG,DG/
Glide binary /w, j/Liquid binary /r, R, ô, l/Stop binary /p,t,k,b,d,dG,g,q,P/ includes Spanish spirantized stops
Affricate binary />tS,
>dZ/
Emphatic binary /sG,dG,tG,DG/ in Arabic
∗Emphasized factors are the ones Giavazzi (2010) expects to be affected by stress domains
form of conjunction of constraints: violating two lower-ranked constraints may be
worse than violating a higher-ranked constraint, as opposed to standard OT Prince
and Smolensky (1993). However, no specific interaction terms were introduced into
the model, and the model cannot fit the conjunction of constraints independently.
Preference for voiceless stridents will surface as a product of preferences for stridents
and voiceless obstruents, and not as an independent feature.
After a phonological model was established, I added two information theoretic
variables. First, I added the negative log predictability of the segment given all the
preceding segments from the beginning of the word, following van Son and van Santen
(2005) (see chapter 2). This measurement assesses how much information the listeners
gain by understanding what the segment is, given that they know what the preceding
CHAPTER 4. LEXICON, USAGE AND INFORMATION 111
segments in the same word are. Segments that are less predictable in the context they
appear in provide more information than segments that are predictable in context. In
the cherry-picked examples in (4.1), redundant (completely predictable from context;
they provide 0 bits of information) /p,t,k/ appear in the onset of stressed syllables. In
contrast, /p,t,k/ in (4.2) are very informative (unpredictable from context < Pr(2−7);
they provide > 7 bits of information). The words in (4.1) are often reduced so that
they do not include the redundant stressed onsets (euro, rep, app) while such processes
do not affect the words in (4.2).
(4.1) Redundant /p,t,k/ (= 0 bits):
European (/p/), reputation (/t/), application (/k/)
(4.2) Informative /p,t,k/ (> 7 bits):
capacity (/p/), eternal (/t/), undercover (/k/)
Additionally, I added the average (expected) value of the segment’s predictability
across the entire language. This is the segment’s informativity (see chapter 2). The
two variables are collinear, and I therefore residualized the contextual predictability
using informativity (the average value of contextual predictability). Adams et al.
(2009) found that informativity significantly predicts the distribution of onsets in
German, but did not control for predictability. If informativity affects the distribution
of consonants in the onset of stressed syllables, it would mean that segments that are
less predictable across the board are more likely to be found in the onsets of stressed
syllables, regardless of the amount of information they provide in that particular
context.
In contrast to change processes which seem to affect every segment of a certain
type (all word-final /t/s), splitting the informativity and local predictability in the
case of positional prominence has no pre-theoretic justification. If language is sensitive
to segment identity when placing highly informative segments in prominent positions
(it matters that the /p/ in the word apart is a /p/), segment informativity is a
better approximation of the amount of information a segment carries. However, if
language is not sensitive to the identity of the segment in question (the /p/ in apart
just happens to be a /p/, only the amount of information it carries matters), then
segment predictability is a better approximation for the amount of information a
CHAPTER 4. LEXICON, USAGE AND INFORMATION 112
segment carries. The answer to this question can be answered experimentally, and
across a number of languages.
4.3 Studies
4.3.1 American English
Method and materials The study follows the outline listed in §4.2.2, with the
following adjustments. Word counts were collected from the Buckeye (Pitt et al.,
2007) and Switchboard (Godfrey and Holliman, 1997) corpora. The CMU dictionary
(Weide, 1998) was used to provide each word with its phonology. Only intervocalic
coronals, labials and dorsals were used for the study. The preceding vowel could
have any stress, but the following vowel was limited to primary stress and unstressed
vowels, excluding secondary stressed vowels since they do not have an equivalent in
Spanish and Arabic. This yielded 17, 021 segments in the relevant contexts. Each
segment was weighted by the number of times it was observed in that context (the
number of times that word was used). Table 4.2 provides a sample of the data that
was used in the study.
Results In the pure phonology model, distance from word start position signifi-
cantly lowered the odds of the segment to be in a stressed syllable onsets (p < 10−15).
Dorsals and labials were significantly more likely to favor stressed onsets than coro-
nals (p < 10−15), as Dmitrieva and Anttila (2008) showed. Stridents, nasals, stops,
glides and liquids were more likely to be found in stressed syllables (p < 10−15), as
were affricates (p < 0.01). Dentals, palatals and voiced obstruents were less likely to
appear in stressed syllables (p < 10−10, p < 10−15 and p < 10−15 respectively). The
full model can be found in table 4.3.
After refitting the model to allow information theoretic measurements to account
for the variance, both high informativity and high negative log contextual predictabil-
ity promoted the chance of the segment to appear in stressed syllables (p < 10−15, p
< 0.0001 respectively). But their inclusion changed the directionality of place of ar-
ticulation. Labials were now less likely to be found in stressed syllables than coronals
and dorsals (p < 0.01). Some other variables were no longer significant: affricates and
CHAPTER 4. LEXICON, USAGE AND INFORMATION 113
Table 4.2: Sample American English stress data
worda eternal eternal undercover reputation reputation undercoversegmenta /t/ /n/ /k/ /t/ /S/ /v/weight (frequency)b 1 1 5 43 43 5stressedc true false true true false falseplace coronal coronal dorsal coronal coronal labialstop true false true true false falseliquid false false false false false falsenasal false true false false false falseglide false false false false false falsevoiced false false false false false trueaffricate false false false false false falsestrident false false false false true falsepalatal false false false false true falsedental false false false false false falsestart dist. 2 4 5 6 8 7end dist. 5 3 4 5 3 2informativity 1.4638 1.5211 2.4174 1.4638 4.0980 1.6006predictability (resid.) 9.8404 -2.0960 5.1352 -2.0753 -3.0294 -2.1248
a Word and segment were not used in the regression.b Word frequency was used to weigh each data point.c Primary stress or no stress is the predicted value.
voiced obstruents did not have a significant trend in any direction. The full model
can be found in table 4.4.
Discussion As Adams et al. (2009) observed for German, high amount of informa-
tion increases the likelihood of a segment to appear in a stressed syllables (a prominent
position). Several phonological factors did not improve the model following the inclu-
sion of information theoretic variables. This suggests that information, rather than
any of these variables, is indeed a driving force in deciding which segment will appear
in which context.
The significance of the distance from the beginning of the word follows from
American English stress often falling on the first syllable. Of the factors predicted by
Giavazzi (2010), stridents and stops were indeed more likely to be found in stressed
CHAPTER 4. LEXICON, USAGE AND INFORMATION 114
Table 4.3: American English pure phonology model
Estimate Std. error z value Pr(> |z|)intercept -0.07747 0.02373 -3.265 0.00110 **distance from word start (log) -2.47209 0.01469 -168.227 <2e-16 ***poa is dorsal 1.94869 0.01456 133.846 <2e-16 ***poa is labial 1.00560 0.01226 82.054 <2e-16 ***dental -0.25121 0.03709 -6.772 1.27e-11 ***palatal -1.78351 0.04603 -38.750 <2e-16 ***strident 2.04828 0.02184 93.807 <2e-16 ***stop 1.18192 0.01573 75.120 <2e-16 ***nasal 1.12059 0.01900 58.993 <2e-16 ***glide 1.57705 0.03121 50.523 <2e-16 ***liquid 0.93158 0.02194 42.452 <2e-16 ***voiced -0.15663 0.01096 -14.290 <2e-16 ***affricate 0.17152 0.05791 2.962 0.00306 **
syllables. That marked places of articulation (labial and dorsal) were not more or less
likely to be found in stressed syllables than the less marked coronals also supports
Giavazzi’s account, as she predicts that place of articulation will not be affected by
stressed positions. That nasals, liquids and glides also prefer stressed syllables has
no theoretic motivation.
4.3.2 Spanish
Method and materials The study follows the outline listed in §4.2.2, with the
following adjustments. Word counts and dictionary representations were collected
from the CALLHOME Spanish Lexicon (Garrett et al., 1996). Only conversational
spoken Spanish (excluding news broadcasts) was used for calculating word counts and
predictability. The dictionary provided 45, 147 intervocalic segments. Each segment
was weighted by the number of times it was observed in that context (the number of
times that word was used). Table 4.5 provides an example of the data used in the
study.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 115
Table 4.4: American English information model
Estimate Std. error z value Pr(> |z|)intercept -0.533199 0.025331 -21.050 <2e-16 ***distance from word start (log) -2.403947 0.015785 -152.291 <2e-16 ***poa is dorsal 1.427377 0.016559 86.198 <2e-16 ***poa is labial -0.045890 0.019453 -2.359 0.0183 *dental -1.738853 0.041945 -41.456 <2e-16 ***palatal -2.859653 0.035122 -81.420 <2e-16 ***student 1.287754 0.023963 53.740 <2e-16 ***stop 0.604764 0.017107 35.352 <2e-16 ***nasal 0.817213 0.019038 42.924 <2e-16 ***glide 1.011187 0.031595 32.005 <2e-16 ***liquid 0.153127 0.023750 6.448 1.14e-10 ***voiced -0.017785 0.011437 -1.555 0.1199informativity 0.543886 0.007705 70.588 <2e-16 ***predictability 0.009879 0.002286 4.321 1.55e-05 ***
Results In the pure phonology model, distance from word end position significantly
increased the odds of the segment to be in a stressed syllable (p < 10−15). Distance
from the beginning of the word decreased the odds of appearing in a stressed syllable,
even though Spanish stress is more likely to appear the further the segment is from
the beginning of the word as stress in Spanish tends to appear in one of the final
two syllables. Dorsals and labials were significantly more likely to appear in stressed
syllables than coronals (p < 10−15). Stridents, liquids, voiced obstruents, nasals,
stops, glides and palatals were more likely to appear in stressed syllables (all at p <
10−15). Only affricates were less likely to appear in stressed syllables (p < 10−15).
The full model can be found in table 4.6.
After refitting the model to allow information theoretic measurements to account
for the variance, both high informativity and high negative log contextual predictabil-
ity promoted the chance of the segment to appear in stressed syllables (at p < 10−7
and p < 10−15 respectively). Their inclusion removed the significance of distance from
the beginning of the word, possibly because of its collinearity with predictability. The
CHAPTER 4. LEXICON, USAGE AND INFORMATION 116
Table 4.5: Sample Spanish stress data
word amarillo amarillo amarillo comedia palabrasgloss ‘yellow’ ‘yellow’ ‘yellow’ ‘comedy’ ‘words’phone /m/ /r/ /j/ /m/ /l/weight 24 24 24 5 156prominent false true false true trueplace labial coronal dorsal labial coronalstop false false false false falseliquid false true false false truenasal true false false true falseglide false false true false falsevoiced false false false false falseaffricate false false false false falsestrident false false false false falsepalatal false false true false falsestart dist. 2 4 6 3 3end dist. 6 4 2 5 6informativity 2.9494 1.3772 2.7809 2.9494 2.8573predictability (resid.) 2.7935 0.5244 -2.3598 -0.8619 2.4028
full model can be found in table 4.7.
Discussion As is the case in German and English, high amount of information
increases the likelihood of a segment to appear in a stressed syllable. No variable lost
its significance, but the incorrect influence for distance from beginning of the word was
corrected by the inclusion of information theoretic variables. It is interesting to note
that even in the pure phonological model, obstruent voicing had the opposite influence
than its equivalent in the pure phonological model for English stressed syllables.
Voiced obstruents were less likely to appear in the onset of stressed American English
onsets, but more likely to appear in the onset of Spanish onsets. The inclusion of
obstruent voicing in either model is likely arbitrary, and signifies language-specific
trends, rather than linguistic, cognitive, articulatory or perceptual biases.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 117
Table 4.6: Spanish pure phonology model
Estimate Std. error z value Pr(> |z|)intercept -3.533203 0.034280 -103.07 <2e-16 ***distance from word end (log) 1.339307 0.007038 190.30 <2e-16 ***poa is dorsal 0.577772 0.013278 43.51 <2e-16 ***poa is labial 0.250386 0.009618 26.03 <2e-16 ***strident 1.422171 0.031724 44.83 <2e-16 ***affricate -1.211360 0.039524 -30.65 <2e-16 ***liquid 1.177949 0.031318 37.61 <2e-16 ***distance from word start (log) -0.098767 0.007710 -12.81 <2e-16 ***voiced 0.142255 0.011786 12.07 <2e-16 ***nasal 0.983487 0.030220 32.54 <2e-16 ***stop 0.898821 0.031334 28.68 <2e-16 ***glide 0.653140 0.041628 15.69 <2e-16 ***palatal 0.360191 0.025861 13.93 <2e-16 ***
4.3.3 Egyptian Arabic
Method and materials The study follows the outline listed in §4.2.2, with the fol-
lowing adjustments. Word counts and dictionary representations were collected from
the LDC Egyptian Colloquial Arabic Lexicon (Kilany et al., 1997). Geminates were
considered to be a single segment followed by a gemination symbol. Pharyngeals,
glottals and segments which are not part of the native Egyptian Arabic inventory
(/v,q,>dZ/) were not included. Only conversational spoken Arabic was used for cal-
culating word counts and predictability. This procedure yielded 12, 485 intervocalic
segments. Each segment was weighted by the number of times it was observed in that
context (the number of times that word was used).
Results In the pure phonology model, distance from word end position significantly
increased the likelihood of the segment to appear in stressed syllables, as stress in
Arabic tends to appear in the final or penultimate syllables (in some cases on the
antepenultimate syllable), and the model has no other way to express a dispreference
CHAPTER 4. LEXICON, USAGE AND INFORMATION 118
Table 4.7: Spanish information model
Estimate Std. error z value Pr(> |z|)intercept -3.747105 0.042432 -88.309 <2e-16 ***distance from word end (log) 1.306119 0.007197 181.476 <2e-16 ***poa is dorsal 0.579003 0.014450 40.069 <2e-16 ***poa is labial 0.218332 0.012108 18.032 <2e-16 ***strident 1.468424 0.035287 41.613 <2e-16 ***affricate -1.193195 0.041861 -28.504 <2e-16 ***liquid 1.211737 0.033614 36.049 <2e-16 ***distance from word start (log) 0.013644 0.009149 1.491 0.136voiced 0.163919 0.012085 13.564 <2e-16 ***nasal 1.047424 0.033670 31.109 <2e-16 ***stop 0.899327 0.033996 26.454 <2e-16 ***glide 0.706310 0.044466 15.884 <2e-16 ***palatal 0.281682 0.026142 10.775 <2e-16 ***predictability 0.049276 0.002172 22.691 <2e-16 ***informativity 0.035828 0.006269 5.715 1.1e-08 ***
for the onsets of word-final light syllables from being stressed. Like Spanish, distance
from word start position significantly decreased the likelihood of the segment to ap-
pear in a stressed syllable. Dorsals were less likely than coronals to appear in stressed
syllables and labials more likely than coronals to appear in stressed syllables (both at
p < 10−15). Nasals, glides and voiced obstruents were less likely to appear in stressed
syllables (all at p < 10−15). Palatals, stops and emphatics were more likely to appear
in stressed syllables (at p < 10−15, p < 10−9 and p < 0.001 respectively). The full
model can be found in table 4.9.
After refitting the model to allow information theoretic measurements to account
for the variance, both high informativity and high negative log contextual predictabil-
ity promoted the chance of the segment to appear in stressed syllables (both at p <
10−15). Distance from the beginning of the word was reversed and promoted the like-
lihood that a segment would appear in stressed syllables (p < 10−15), though it is not
clear why. Finally, emphatic segments were no longer different from non-emphatics.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 119
Table 4.8: Sample Egyptian Arabic stress data
word /abadan/ /abadan/ /QaSara/ /QaSara/ /gamila/ /gamila/gloss ‘never’ ‘never’ ‘ten’ ‘ten’ ‘beautiful’ ‘beautiful’phone /b/ /d/ /S/ /r/ /m/ /l/weight 55 55 114 114 18 18prominent false false false false true falseplace labial coronal coronal coronal labial coronalstop true true false false false falseliquid false false false true false truenasal false false false false true falseglide false false false false false falsevoiced true true false false false falseaffricate false false false false false falsestrident false false true false false falsepalatal false false true false false falseemphatic false false false false false falsestart.dist 2 4 3 5 3 5end.dist 6 4 5 3 5 3informativity 2.7182 2.3157 2.5958 2.0695 2.4511 1.5643predictability (resid.) 2.1582 -2.2409 -1.2295 -2.1346 -0.5913 -1.5808
The full model can be found in table 4.10.
Discussion As is the case with German, English and Spanish, high amount of
information increases the likelihood of a segment to appear in stressed syllables.
The different effect of emphatics disappeared with the inclusion of the information
theoretic variables, which suggests that its inclusion does not have a phonetically
justified reason.
Even in the pure phonological model, place of articulation had a different effect
than in Spanish and English, with dorsals being less likely to appear in stressed
syllables. This undermines any attempt to rely on place of articulation as a reason
for the distribution of segments in stressed syllables, in agreement with Giavazzi
(2010). Additionally, like in English and Spanish, Arabic stops were preferentially
placed in the onsets of stressed syllables. However, Arabic stridents did not have a
significant influence on a segment’s likelihood to appear in stressed syllables, unlike
CHAPTER 4. LEXICON, USAGE AND INFORMATION 120
Table 4.9: Egyptian Arabic pure phonology model
Estimate Std. error z value Pr(> |z|)intercept -2.81070 0.06082 -46.215 <2e-16 ***distance from word end (log) 1.81330 0.02890 62.750 <2e-16 ***poa is dorsal -0.32763 0.03174 -10.323 <2e-16 ***poa is labial 1.00953 0.02771 36.430 <2e-16 ***nasal -0.71157 0.02819 -25.244 <2e-16 ***voiced -0.80284 0.03540 -22.676 <2e-16 ***glide -0.91989 0.04091 -22.488 <2e-16 ***palatal 0.92086 0.03965 23.227 <2e-16 ***distance from word start (log) -0.30309 0.02522 -12.017 <2e-16 ***stop 0.18223 0.02837 6.424 1.33e-10 ***emphatic 0.18107 0.05126 3.532 0.000412 ***
in Spanish and Arabic. Similarly, glides were significantly less likely to appear in
stressed syllables unlike Spanish and English glides, and liquids did not significantly
affect segments’ likelihood to appear in stressed syllables, making their inclusion in
the English and Spanish models language-specific, rather than cross-linguistic.
4.4 General discussion
All three languages, Egyptian Arabic, American English and Spanish, have shown
evidence for a preference to place highly informative segments in stressed syllables.
The three languages have different phonologies and phonotactics, yet every one of
them systematically promoted highly informative segments to perceptually prominent
positions. This preference emerged even though from the phonological point of view
there are very few constraints on onsets of stressed and unstressed syllables, and while
of the phonological variables only stops consistently appeared in stressed syllables. It
is unlikely that a high amount of information was randomly associated with prominent
syllables in all three languages when only one of the other phonological variables
provided similar persistence.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 121
Table 4.10: Egyptian Arabic information model
Estimate Std. error z value Pr(> |z|)intercept -3.261618 0.073370 -44.454 <2e-16 ***distance from word end (log) 1.393080 0.030419 45.797 <2e-16 ***poa is dorsal -0.547353 0.040644 -13.467 <2e-16 ***poa is labial 0.851916 0.033111 25.729 <2e-16 ***nasal -0.673206 0.030772 -21.877 <2e-16 ***voiced -0.943729 0.036552 -25.819 <2e-16 ***glide -0.637876 0.042465 -15.021 <2e-16 ***palatal 0.809559 0.040657 19.912 <2e-16 ***distance from word start (log) 0.255801 0.028254 9.054 <2e-16 ***stop 0.257621 0.028802 8.945 <2e-16 ***predictability 0.256013 0.005794 44.189 <2e-16 ***informativity 0.202002 0.018399 10.979 <2e-16 ***
What made each of the three languages correlate information with perceptual
prominence? The lexicon and the frequencies of its usage are formed by phonological
processes and usage choices for words, roots and affixes. Giavazzi (2010) claims that
stress affects very few features among stressed syllables, and the studies presented in
this chapter support her claim. For argument’s sake I will assume that a wide variety
of such processes may exist. In this case the correlation between prominence and
information could have emerged in three ways:
1. Phonological processes that preferentially reduce informative segments in non-
prominent positions. Duration and deletion studies such as the ones presented
in chapter 2 have not found such effects.
2. Phonological processes that preferentially reduce uninformative segments in
prominent positions. Again, duration and deletion studies found no such ef-
fects.
3. Word, root and affix-selecting processes prefer forms that fulfill the requirement
that highly informative segments would appear in stressed syllables.
CHAPTER 4. LEXICON, USAGE AND INFORMATION 122
All three alternatives require that linguistic (phonology) or psycholinguistic (lex-
ical access) processes be sensitive to the amount of information linguistic elements
hold. Similarly, all three accounts involve sensitivity to perceptual prominence, in
this case syllable stress. Both the sensitivity to information and to perceptual promi-
nence contribute to an ongoing discussion on the role information theoretic factors
have on language production.
In recent years there has been mounting evidence that predictability and fre-
quency affect the duration of linguistic elements in language production. Jurafsky
et al. (2001) showed that frequent words tend to have shorter duration when pre-
dictable in context, and Bell et al. (2009) have shown that frequent content words
tend to have shorter duration than infrequent content words. Similar effects have
been demonstrated below the level of the word, for syllables (Aylett and Turk, 2004),
morphemes (Pluymaekers et al., 2005) and intervocalic segments (van Son and van
Santen, 2005). Information theoretic effects are not limited to duration and were also
shown to affect the omission of linguistic elements at the level of syntactic planning
such as the case of that-omission (Levy and Jaeger, 2007; Jaeger, 2010). The studies
presented in chapter 2 extend this line of research.
However, the source of the information-theoretic effects is under dispute. Some
studies such as Aylett and Turk (2004), van Son and van Santen (2005) and Levy
and Jaeger (2007) (among others) support a view in which the amount of information
an element holds is correlated with its duration in order to improve communication.
Elements that hold little information require less time to transmit and can therefore
be reduced to improve the information rate. On the other hand, elements that hold a
lot of information may require longer to process and should therefore be provided with
longer duration in order to guarantee transmission. In essence, these accounts make
two claims. Linguistic performance is affected by the amount of information linguistic
elements hold, and there is a communicative goal to such effects (communication is
improved). Other accounts propose a different view.
Bell et al. (2009) argue that the reduced duration of frequent words emerges
from the faster access times these words have. They compare the longer duration
of infrequent words to the elongation of word duration when the following context
(or word) is not available (Fox Tree and Clark, 1997; Ferreira and Dell, 2000). Since
CHAPTER 4. LEXICON, USAGE AND INFORMATION 123
infrequent words take longer to access, the articulatory planning that will lead to
their articulation is slowed down and the words end up having longer duration. Bell
et al.’s account differs significantly from the communication-efficiency accounts as it
opposes both the claim that it is the amount of information these words hold that
affects their duration and the claim that the longer duration’s purpose is to improve
communication. Similarly, Bybee and Hopper (2001) view the reduction of duration
of frequent linguistic elements as a byproduct of practice, and therefore implicitly
reject the view of frequency as information and of shortening as means to improve
communication.
Some accounts may accept only one of the two premises. Jaeger (2010, pp. 50–
51) proposes that even if one rejects the idea that speakers attempt to transmit
information efficiently, the flow of information in their mind is still subject to the
same principles, and requires more time to allow more information to be transmitted
in the speakers’ mind. More informative linguistic elements therefore take longer
to process and as a byproduct they take longer to articulate. On the other hand
availability accounts such as Fox Tree and Clark (1997) assume that speakers attempt
to remain fluent (that is, communicative) without referring to information theoretic
effects. Therefore, Fox Tree and Clark assume communication-based biases, but not
necessarily information-based biases.
The results presented in this study require that language be sensitive to the
amount of information linguistic elements hold and to perceptual prominence (in
this case stress). Practice (Bybee and Hopper, 2001) or availability-based accounts
for the duration of words (Bell et al., 2009) do not predict sensitivity to the amount of
information linguistic elements hold. Similarly, the advantage of placing highly infor-
mative elements in perceptually prominent positions is not something that speaker-
internal mechanisms would benefit from. Therefore, the preference for placing highly
informative linguistic elements in stressed syllables shows that at least for some phe-
nomena, it is not possible to do without sensitivity to information theoretic factors,
nor without appealing to communication. This is not to say that in other domains
information theoretic-like effects cannot emerge through mechanisms that have little
to do with the amount of information linguistic elements hold or without appealing to
communication between speakers. However, excluding communication or information
CHAPTER 4. LEXICON, USAGE AND INFORMATION 124
theoretic effects from the range of possible solutions cannot be justified on the basis
of parsimony, as such motivations are necessary.
4.5 Conclusion
In this study I showed that in three languages, the lexicon and usage patterns of
each language make use of the perceptual prominence of stressed syllables to ease
the transmission of highly informative segments across the communication channel.
Only one phonological factor was as consistent in promoting appearance in stressed
syllables as the information theoretic factors. The limited interaction between stress
and the onsets of stressed syllables has been observed in previous research (Smith,
2002; Giavazzi, 2010), but it is not clear why even among the few factors that were
assumed to be influenced by the position of stress only one was persistently affected
by the location of stress. Information is therefore shown to affect phonology in ways
that conventional phonological factors cannot explain.
The findings also demonstrate that information theoretic factors and their ef-
fect on communication are necessary to explain linguistic phenomena, and that such
effects cannot be reduced to other psycholinguistic, cognitive and articulatory fac-
tors. Previous chapters have shown that information-theoretic considerations affect
performance-based phenomena (Aylett and Turk, 2004; van Son and van Santen,
2005) and license phonological processes (chapter 3). This chapter shows that the
amount of information linguistic elements hold affects the phonotactics of the lan-
guage through the lexicon and usage patterns of the lexicon, complementing previous
research (Piantadosi et al., 2009, 2011).
Chapter 5
Predicting segment distribution
universals
5.1 Introduction
Zipf (1935, III: phonemes) shows that across a wide variety of languages (with a
few exceptions) complex segments are less frequent than their simple counterparts:
voiced stops are less frequent than voiceless stops and aspirated stops are less frequent
than unaspirated stops. Zipf argues for a negative correlation between frequency and
complexity. The more complex a segment is, the less frequent it would be in a human
language. The observed correlation is well-motivated by the principle of least effort
(Zipf, 1949) – if segment complexity corresponds to its articulatory effort, simpler
segments should be more frequent than complex ones.
However, taken to extreme, the prediction that simple segments should be more
frequent than complex ones predicts languages in which no complex segments exist.
This prediction is wrong, as none of the languages in Zipf’s survey lost its complex
segments nor reduced them to infinitesimal frequencies. Zipf (1935, III.3.f) therefore
stipulates a set of constraints – upper thresholds of toleration for each segment. A
segment that becomes more frequent than its upper threshold of toleration would be-
gin to weaken. This set of stipulated constraints is necessary to describe the observed
data, but is not motivated except by the sensible requirement that languages main-
tain a minimal number of distinctions. The requirement for maintaining a minimal
125
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 126
number of distinctions can be met with simpler (though empirically inadequate) stip-
ulations such as a single upper limit on the frequency of any segment. It is therefore
necessary to understand what motivates the multiple upper thresholds in Zipf (1935).
An unconstrained principle of least effort predicts a language with no complex
segments. Information theory (Shannon, 1948) makes the opposite wrong prediction.
If language efficiency is measured in being able to transmit a given amount of infor-
mation using as few segments as possible (have maximal information rate), the most
efficient language encoding would have a uniform distribution of segments. In the
most efficient language every segment would be as frequent as any other segment.
As Zipf (1935) shows, this is not the case, as languages have skewed distributions of
segments. From the information theoretic point of view it is necessary to understand
what keeps human languages from having a uniform distribution of segments.
In this chapter I propose that languages maximize the ratio between the expected
amount of information per segment (the entropy of the language) and the expected
amount of markedness (effort) per segment.1 This prediction is a simple corollary
of MULE, as described in chapter 3: languages attempt to maximize the amount
of information while minimizing the amount of markedness. This proposal correctly
predicts that languages would not lose their complex segments and will not have a
uniform distribution of segments either.
The proposal that language maximize the ratio between the expected amount of
information per segment and the expected amount of effort correctly predicts the
skewed distributions of segment frequency in Zipf (1935) without stipulating upper
thresholds on the frequency of segments. More importantly, it provides a powerful
tool to assess the inverse question: what is the relative markedness of similar seg-
ments? In chapter 3 I stipulated that segments that exist in fewer languages are
more marked or effortful than segments that exist in more languages (an aggregate of
binary distinctions), a method which is not unlike the one used in Ohala (1981) who
matches articulatory and perceptual phonetic observations with the frequency of the
absence of segments in the world’s languages (Sherman, 1975). If my prediction that
languages maximize the ratio between information and markedness holds, it would
1I use the word expected in the statistical sense – an average of alternative outcomes, weightedby their probability. If f(x1) = 8, f(x2) = 4 and Pr(x1) = 0.25, Pr(x2) = 0.75 then the expectedvalue of all f(xi), E [f (xi)] is 0.25 · 8 + 0.75 · 4 = 2 + 3 = 5.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 127
be possible to assess the relative markedness of segments by observing their relative
frequencies in languages in which they do exist (an aggregate of proportions).
One prediction that follows from the predictive power of the proposed account is
that the relative markedness of /t/ is lower than that of /k/ which is lower than that of
/p/. I show that the relative frequency of the three segments P(t)>P(k)>P(p) holds
cross-linguistically, in agreement with the number of languages in which they exist
cross-linguistically (Sherman, 1975; Maddieson, 1984). I further refine my prediction,
by showing that markedness should factor in not only articulatory effort, but also
perceptual confusability.
5.2 Consistent asymmetries between complex and
simple segments
As Zipf (1935)[III] points out, it is not trivial to measure the articulatory effort that
the pronunciation of segments requires. However, for some pairs of segments, it can
be argued that one segment requires roughly the same articulatory gesture as the
other segment and an additional gesture. In this view both /p/ and /b/ require the
speaker to stop the flow of air using the lips, let pressure build and then release it.
/b/ also requires the speaker to cause the vocal folds to vibrate, which /p/ does not.
Zipf calls the segment that requires the additional gesture complex and the other
one simple. He identifies different types of complex stops: voiced vs. voiceless and
aspirated vs. unaspirated. Voiced stops require voicing: /b,d,g/ are the same as
/p,t,k/ except for the vibration of the vocal folds. Aspirated stops require aspiration
to follow the stop: /ph,th,kh/ are the same as /p,t,k/ except that they require the
speaker to delay the VOT of following vowels. Zipf’s distinction between complex
and simple segments is phonemic. He is aware that the exact phonetic correlate of
complex and simple phonemes can vary across environments.
Zipf (1935) goes on to show that across several languages, and with few excep-
tions, complex segments are less frequent than their simple counterparts. He demon-
strates that for all segments that allow contrast between aspirated and unaspirated
obstruents in Mandarin Chinese spoken in Beijing (Peipingese Chinese in Zipf 1935),
Danish and Cantonese Chinese, unaspirated (simple) obstruents are more frequent
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 128
than their aspirated (complex) counterparts. Burmese has a three-way contrast be-
tween aspirated fortes, unaspirated fortes and lenes stops, and Zipf shows a similar
pattern between its unaspirated and aspirated fortes stops, with a single exception
– P(kh)>P(k). Zipf’s observations are not independent in the statistical sense – he
took multiple samples from each language, and many of the languages in his sample
are related (though they are not mutually intelligible). Overall, Zipf tests 17 cases of
relative frequencies, and in all but one he correctly predicts that aspirated obstruents
would be less frequent than their unaspirated counterparts.
Zipf’s test of the contrast between voiced and voiceless stops uses eleven Indo-
European languages as well as Hungarian. Using data from phonemically transcribed,
phonetically transcribed and alphabetic data, he shows that in Czech, Dutch, French,
Italian, English, Hungarian, Bulgarian, Russian, Spanish, Greek, Latin and Vedic
Sanskrit, the voiceless stops /p,t,k/ are more frequent than their voiced counterparts
/b,d,g/, with two exceptions – in Spanish P(d)>P(t) and in Hungarian P(b)>P(p).2
In this case too the observations were not statistically independent from one another.
Most of the languages in the sample were related to one another, and multiple samples
were taken from each language. Zipf’s prediction held in 34 out of 36 comparisons.
The observed relative frequencies are attributed to a negative correlation between
frequency and complexity: the more complex segments are, the less frequent they will
be. This pattern would seem to emerge from the principle of least effort (Zipf, 1949):
if more complex is treated as more effortful, effort-avoidance would lead to a desire
to avoid effortful sounds. However, Zipf is well aware that without counter-balancing
effort avoidance, very different languages would be predicted – languages in which no
complex segment exists. In order to avoid that outcome, Zipf (1935, III.3.f) stipulates
that languages must have for each segment an upper threshold of toleration that puts
a limit on the maximal frequency that segment may have, so that a sufficient number
of segments continue to exist in the language. A segment that passes that threshold
would begin to weaken.3 Zipf stipulates that each segment’s different upper threshold
matches the negative correlation between complexity and frequency.
2In the data collected for chapter 3 using the Spanish CALLHOME Lexicon (Garrett et al., 1996),/t/ was more frequent than /d/, following Zipf’s prediction.
3That’s a Zipf (1935) version of MULE, as presented in 3. His version attempts to predict bothweakening and fortition. In contrast, Zipf (1929) version of weakening follows from frequent use anddoes not have similar predictions. It only predicts the weakening of frequent segments.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 129
Multiple different upper thresholds are justified by the need to insure that a suf-
ficient number of contrasts continues to exist in every language. But in order to
maintain the number of contrasts simpler stipulations would have sufficed. A single
uniform upper threshold beyond which the relative frequency of any single segment
cannot rise would have forced each language to maintain distinctions (the lower the
threshold, the greater the number of distinctions). However, a single upper threshold
would have yielded languages in which the frequency of all simple segments is close
to the stipulated upper threshold. This is not the case. In every language that Zipf
tested, /t/ was more frequent than /k/, even though both are simple.4 A require-
ment for individual upper thresholds is therefore required to correctly describe the
data that Zipf provides, but is not motivated.
Unlike Zipf’s prediction of skewed distributions of segments, information theory
(Shannon, 1948) predicts that languages would be most efficient if every segment were
equally probable. This prediction follows from the definition of entropy. A uniform
distribution of segments would yield for a given number of segments a language with
maximal entropy. The entropy of a language is the expected (average) amount of
information every sign in the language contains. In a language that has lower entropy,
every sign provides on average less information, and more signs are needed to transmit
a given amount of information than in a language that has higher entropy. Since
deviating from uniform distribution of segments yields lower entropy, languages in
which the distribution of segments is skewed (is not uniform) would be less efficient.
But every language in Zipf (1935, III) does have skewed distribution of phonemes.
If efficiency plays a role in shaping human language, it is necessary to explain why
the distribution of segments in languages is not uniform. Any functional pressure
on language would have caused a gradual convergence to more efficient form. Even
if languages start off having non-uniform distributions, they are expected to slowly
move in the direction of uniform distribution of segments (using the definition of ef-
ficient communication used in Piantadosi et al. 2011). However, even though related
languages may undergo separate change processes, languages maintain the asymme-
tries Zipf predicts despite undergoing many processes of phonological and phonetic
change. It seems that maximizing entropy is not a functional pressure on languages.
4In Zipf’s data Cantonese has more /k/ than /t/ than /p/. However, Zipf relies on a dictionaryin which k w is regarded as a k followed by a /w/, rather than a labialized dorsal.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 130
If languages do yield to functional pressures over time, and if the functional pres-
sures can be shown not to be motivated by the minimization of complexity nor the
maximization of language entropy, the question is what do languages optimize. I will
attempt to answer this question in the following section.
5.3 Solution: maximizing information per effort
In chapter 3, I introduced MULE, and showed that the balance between effort and in-
formation utility predicts the actuation of weakening processes in language. A simple
corollary of MULE provides a solution to the question above. I propose that lan-
guages maximize the ratio between their entropy and their expected articulatory and
perceptual effort (their markedness). This proposition follows from two observations.
First, the information theoretic prediction that all segments would be equally proba-
ble relies on the assumption that transmitting every segment is equally effortful. This
assumption does not hold in human language. Some segments require more effort to
produce, and some segments are more difficult to tell apart from other segments (Ste-
riade, 2008; Flemming, 2004, among others). Second, the principle of least effort has
to be grounded in the communicative goal of human language – the transmission of
information. Given that language is used to communicate and transmit information,
the principle of least effort has to be interpreted as the least amount of effort to
transmit a message of an arbitrary amount of information in the language.
The proposal that languages maximize the ratio between the information speakers
transmit and the markedness of their transmission (the effort required to transmit
information) predicts the skewed distributions of complex and simple segments in
Zipf (1935) without appealing to upper thresholds. Consider a language of linguistic
elements e, such that every linguistic element is assigned some markedness value which
is always greater than zero, markedness(e) > 0. I will consider three alternatives:
maximizing the entropy of the language, minimizing the expected markedness of the
language, and maximizing the ratio between the entropy of the language and the
expected markedness of the language, as proposed in this chapter. The languages can
optimize these different goals by changing the probability of each element Pr(e).
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 131
Languages that maximize the entropy of the language (5.1) have a uniform dis-
tribution of segments. They do require fewer elements to transmit a given amount of
information on average, but they overuse marked or effortful elements. The overall
amount of markedness that is required to transmit a given amount of information is
too high.
(5.1)
−∑e
Pr(e) log2(e)
Languages that minimize the frequency of marked elements (5.2) do not use highly
marked elements at all. As a result they have fewer contrasts, and significantly lower
entropy. These result in significantly longer messages.5 Even though such languages
use fewer marked or effortful elements, the multiplication of the greater number of
elements with the lower amount of markedness per element is still high.
(5.2) ∑e
Pr(e)markedness(e)
Languages that maximize the ratio between the entropy of the language and the
expected amount of markedness per element (5.3) do not have a uniform distribution
of elements, and therefore their entropy is lower than that of the languages in (5.1).
They use unmarked segments more than marked ones, but the increase in message
length pays off in the reduced amount of markedness per segment. In such languages
the correlation between markedness and information is positive. The more marked
an element is, the more information it carries. In frequency terms more information
means lower frequency as Zipf predicts.
(5.3)−∑
e Pr(e) log2(e)∑e Pr(e)markedness(e)
There are two crucial differences between this proposal and the one in Zipf (1935).
The first is that Zipf (1935) relies on complexity, rather than articulatory and percep-
tual phonetic properties of each sound. Zipf (1935) attempts to avoid the question of
5 Half the entropy results in messages that are twice as long for a given amount of information.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 132
effort, as he could not measure the articulatory and perceptual effort associated with
using each segment. While it is still true that the articulatory effort and perceptual
confusability of segments cannot be measured, phonetic theory has evolved and does
provide some insight into what makes some segments more difficult to pronounce or
perceive. For instance, Ohala (1981) shows that the articulatory and perceptual prop-
erties of different stops match the number of times they are absent from phonological
systems (Sherman, 1975). It is more difficult to maintain voicing in dorsal positions,
and indeed /g/ is absent from more systems than /d/ and /b/. Similarly, /p/ has
lower amplitude than other voiceless stops, and is predictably absent from more sys-
tems than /t/ and /k/. Articulatory and perceptual difficulties have therefore been
related to the distribution of sounds in the world’s languages.6
The second difference between the current proposal and the account in Zipf (1935)
is that the current proposal does not stipulate upper thresholds on the frequencies of
segments. Those should emerge directly from the articulatory and perceptual phonetic
properties of each sound and its relationship to other sounds (its confusability with
other sounds, for instance).
It is important to note that the unlike the previous chapters, this chapter mea-
sures the amount of information provided by each segment as its frequency in the
language (its negative log probability or uni-phone score) and not using informativity.
This decision follows from necessity. First, the data in Zipf (1935) contains segment
frequencies, not more elaborate information-theoretic measurements. Second, calcu-
lating segment informativity requires the use of corpora, preferably spoken corpora,
whereas for some languages only a limited amount of written data is available.
In order to explore the differences between the two approaches, I conducted two
language surveys using existing corpora. The first survey attempts to tease apart
complexity and effort by replicating Zipf’s survey of voiced and voiceless oral stops.
The second survey focuses on voiceless stops – all simple in Zipf (1935) – to show
how markedness hierarchies emerge from the in-language frequencies of segments in
the several languages.
6 On the other hand, complexity does seem to correspond to the formal definition of markednessin the form of constraints such as ∗Voiced. Most formal systems of markedness agree with Zipf’sdefinition of complexity by not having constraints of the form ∗/g/, ∗/p/ (Prince and Smolensky,1993; de Lacy, 2002). In formal markedness systems the markedness of specific segments ought tofollow from the conjunction of simpler markedness terms such as ∗Voiced:∗Dorsal.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 133
5.4 Survey 1: effort or complexity?
Introduction What do speakers minimize, effort or complexity? Ohala (1981)
presents an asymmetry among voiceless stops, labial /p/ is less audible than other
voiceless stops and the voicing of dorsal /g/ is more difficult to maintain than the
voicing of other voiced stops. Languages such as Classical Arabic in which both
/p/ and /g/ are absent from the oral stop inventory are not rare. Sherman (1975)
shows that among voiceless stops /p/ is absent from more languages than any other
voiceless stop, and similarly /g/ is absent from more languages than any other voiced
stop. Though cross-linguistically more languages have voiceless stops than voiced
stops (Maddieson, 1984), it is not trivial that Zipf’s characterization of complexity
would be sufficient to describe the data. At the very least the contrast in frequencies
between /b/ and /p/ is expected to be smaller than the difference between other
voiced and voiceless stops.
Additionally, Zipf (1935) tested his prediction that complex segments would be less
frequent than simple segments using several related languages (mostly Indo-European
languages). The only exception, Hungarian, had the relative ratio between /p/ and
/b/ reversed, with P(b)>P(p). The exclusion of unrelated languages makes it difficult
to claim that the similarity between the languages is due to functional pressures
rather than retained similarity that is due to a common origin and shared vocabulary.
Another goal of this survey is therefore to replicate Zipf’s survey using languages that
do not share an origin or as much vocabulary.
Methods and materials I test the difference in ratios between /p,t,k/ and /b,d,g/
separately across several languages, which share very few lexical items, and have quite
different grammars.
1. Japanese, using the CALLHOME Japanese Lexicon (Kobayashi et al., 1996).
Gemination was treated as segment lengthening (for consonants). Palatalized
stops were counted with their non-palatalized counterparts. Voiceless affricates
were counted as allophones of /t/.
2. Spanish, using the CALLHOME Spanish Lexicon (Garrett et al., 1996).
3. Hungarian, using data from Zipf (1935).
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 134
4. Biblical Hebrew, using character frequency in the book of Genesis. Gemination
was treated as a segment lengthening, rather than two identical segments. The
distinctions between stop and spirantized variants of /p,t,k/ and /b,d,g/ were
collapsed together.
5. Indonesian, using character frequency in the book of Genesis.7 Digraphs were
treated as single characters.
6. Haitian Creole, using character frequency in the book of Genesis.8. Digraphs
and trigraphs were treated as single characters.
Results The segment probabilities are in Table 5.1. In all six languages, /t/ was
more frequent than /d/ and /k/ was more frequent than /g/, as predicted by Zipf
(1935). If we assume no prior knowledge (each direction is equally probable), the
probability that six (mostly independent) languages would all have more /t/ than /d/
is 0.0156. Having all six languages have the more /k/ than /g/ is equally unlikely.
However, Japanese, Hungarian, Biblical Hebrew and Indonesian all had more /b/
than /p/.9 Zipf (1935) correctly predicts that voiceless /t/ and /k/ would be more
frequent than /d/ and /g/ respectively, but fails to predict that /p/ would not be
more frequent than /b/. Another pattern emerges as well. The ratio between the
frequencies of /k/ and /g/ is greater in every language except Biblical Hebrew than
the ratio between the frequencies of /t/ and /d/ (p > 0.1).
Discussion Zipf (1935) correctly predicts that languages would have consistently
skewed distributions. In all the languages in his survey and in this replication of
his survey, consistent hierarchies between /t/ and /d/ and between /k/ and /g/
were found. However, Zipf’s characterization of segments’ skewed distributions is
apparently imprecise. Had the additional gesture of obstruent voicing been the cause
for preference of voiceless /t/ and /k/ over voiced /d/ and /g/, a similar pattern
would have emerged for /p/ and /b/, but this is not the case. /b/ is more frequent
than /p/ in four out of the six languages used in the survey.
7http://bibledatabase.net/, Indonesian new translation, retrieved August 20128http://bibledatabase.net/, retrieved August 20129In Biblical Hebrew the same hierarchy of frequencies held regardless of whether spirantized stops
were included with their non-continuant counterparts.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 135
Table 5.1: Voiceless and voiced segment probabilities
Japanese segment probabilities
Voiceless Voiced RatioLabial 0.0009 0.0026 0.3Coronal 0.0591 0.0248 2.4Dorsal 0.0578 0.0099 5.9
Spanish segment probabilities
Voiceless Voiced RatioLabial 0.0217 0.0202 1.1Coronal 0.0359 0.0351 1.0Dorsal 0.0337 0.0070 4.9
Hungarian segment probabilities
Voiceless Voiced RatioLabial 0.0104 0.0171 0.6Coronal 0.0718 0.0330 2.2Dorsal 0.0572 0.0245 2.3
Biblical Hebrew segment probabilities
Voiceless Voiced RatioLabial 0.0058 0.0256 0.2Coronal 0.0187 0.0043 4.3Dorsal 0.0144 0.0042 3.4
Indonesian segment probabilities
Voiceless Voiced RatioLabial 0.0255 0.0308 0.8Coronal 0.0472 0.0440 1.1Dorsal 0.0635 0.0096 6.6
Haitian Creole segment probabilities
Voiceless Voiced RatioLabial 0.0479 0.0219 2.2Coronal 0.0672 0.0204 3.3Dorsal 0.0379 0.0092 4.1
The data does support a view of a negative correlation between segment prob-
ability and effort. Ohala (1981) characterizes /p/ as a less preferred voiceless stop
due to perceptual reasons. This dispreference is evident in the absence of /p/ from
sound systems in the world’s languages (Sherman, 1975). The results of this survey
show that even in sound systems in which all six segments exist, the ratio between
the relative frequency of /p/ and /b/ would be smaller than the ratio between the
frequencies of /t/ and /d/ and between /k/ and /g/. Phonetic effort also predicts
that /g/ would also be dispreferred, since it is difficult to maintain voicing in dorsal
positions. Indeed, in five out of six languages the ratio between /k/ and /g/ was
greater than the ratio between /t/ and /d/, though this trend is not significant (p >
0.1).
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 136
5.5 Survey 2: voiceless stops
Introduction A striking fact about the results of the previous survey is that
with the exception of Indonesian, all languages had more /t/ than /k/ than /p/:
P(t)>P(k)>P(p). Indonesian had a different ranking, P(k)>P(t)>P(p). In the Zipf
(1935) survey Mandarin Chinese, Danish, Burmese (lenes), Czech, Dutch, French,
Italian, English, Bulgarian, Russian, Greek and Latin all have P(t)>P(k)>P(p) rank-
ing. The only exception is Vedic Sanskrit P(t)>P(p)>P(k). The frequent pattern
P(t)>P(k)>P(p) matches the number of languages in which one of the voiceless stops
is missing (Sherman, 1975). However, is not predicted by the account presented in
Zipf (1935). Zipf provides no reason to treat any of the voiceless stops as more com-
plex than the other stops. Complexity therefore cannot cause the P(t)>P(k)>P(p)
order to be more frequent than the other orders.
On the other hand, MULE does predict that /t/ would be more frequent than /k/
and /p/ cross-linguistically, since /k/ and /p/ are considered more marked in phonol-
ogy, and MULE predicts a negative correlation between frequency and markedness.
Furthermore, since MULE bases markedness on phonetic grounds, /k/ is predicted to
be more frequent than /p/ on phonetic grounds. The goal of this survey is therefore
to test whether the prediction that MULE makes holds cross-linguistically.
Methods and materials I use the six languages from the previous survey, as well
as Mandarin Chinese, using the data from Zipf (1935), and verified using CALL-
HOME Mandarin Chinese Lexicon (Huang et al., 1996). Mandarin Chinese has a
P(t)>P(k)>P(p) order in both data sources.
There are six possible permutations for ordering /p/, /t/ and /k/. The rankings
are not without order. The frequent permutation in all the languages mentioned in the
introduction to this survey P(t)>P(k)>P(p) is more closely related to the exception
rankings P(k)>P(t)>P(p) and P(t)>P(p)>P(k) than to the other unattested three
rankings, as they require a single reranking of the relative orders, and not two or
three. This relationship is demonstrated in figure 5.1.
Is it a coincidence that the observed rankings emerge? Can they be randomly
generated in a system in which each of the three voiceless stops can be more or less
frequent than any other voiceless stop? I used Kendall’s W (Kendall’s coefficient of
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 137
Figure 5.1: Distance of total orders from /t/>/k/>/p/
/t/ > /k/ > /p/
3
��
2
$$
2
zz
1 **1
tt
no wrong orders (widely attested)
/k/ > /t/ > /p/ /t/ > /p/ > /k/ one wrong order (attested)
/p/ > /t/ > /k/ /k/ > /p/ > /t/ two wrong orders (not attested)
/p/ > /k/ > /t/ three wrong orders (not attested)
concordance), which provides a score between 0 and 1 to a set of rankings such that 0
means that the rankings are completely independent from one another, and 1 means
that all the rankings were exactly the same. To test the significance of Kendall’s W,
I created 10 million samples of seven possible rankings, and scored each sample using
Kendall’s W. I then calculated Kendall’s W for the seven languages in the survey –
six languages whose ranking was P(t)>P(k)>P(p), and one language whose ranking
was P(k)>P(t)>P(p) (Indonesian). I calculated the p-value by comparing how many
random samples had greater or equal Kendall’s W. I used R (R Development Core
Team, 2012) and R’s package concord (Lemon and Fellows, 2007).
Results Kendall’s W score for the seven languages was 0.878. Out of the 10 million
random samples of seven ordering, 0.031% had equal or greater Kendall’s W, which
places the p-value of getting the observed rankings by chance at p < 0.001 with a
sample of only seven languages.
Had one of the P(t)>P(k)>P(p) languages been replaced by a P(t)>P(p)>P(k)
language (e.g., by replacing Spanish with Vedic Sanskrit), Kendall’s W would have
been 0.735, with only 0.36% of the 10 million samples having an equal or greater
Kendall’s W. The p-value in this sample of languages would have been p < 0.01.
Discussion The orders of the frequencies between labial, coronal and dorsal voice-
less stops cannot be random. Even with small and diverse samples of languages, a
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 138
strong preference for particular orderings appears.
The recurrent orders of frequencies cannot be explained by the minimization of
complexity (Zipf, 1935). Without stipulating arbitrary segment-specific upper thresh-
olds of toleration, this theory would not predict that some simple segments would be
consistently more frequent than other simple segments. However, a negative correla-
tion between markedness and effort does predict the observed patterns.
5.6 Markedness as effort
5.6.1 Missing pieces in the puzzle
Survey 1 showed that phonetic grounding of markedness predicts the negative correla-
tion between markedness and frequency better than the notion of complexity. Survey
2 shows that some voiceless stops tend to be less frequent than other voiceless stops,
in agreement with their absence from segment inventories in the world’s languages.
Two questions remain. First, MULE is based on ranking phonological markedness us-
ing articulatory and perceptual effort. Therefore, when irregular patterns emerge, it
would be beneficial if the irregularity could be explained on phonetic grounds. Second,
both Zipf’s approach and this study rely on languages’ ability to optimize segment
inventories over time. Piantadosi et al. (2011) showed that languages shape their
usage patterns and lexicons to correlate informativity with word length. I showed in
chapter 4 that languages are able to optimize their lexicons and usage patterns in
order to place highly informative segments in perceptually prominent positions. Is
there evidence that such changes are possible in response to the relationship between
segment frequency and phonetic effort?
In the following sections I focus on two of the irregular orders found in the
previous section. I attempt to answer what causes languages to deviate from a
P(t)>P(k)>P(p) order, and examine whether languages can optimize their inven-
tories to have particular orders. Finally, I try to answer what forms of phonetic effort
determine the observed patterns.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 139
5.6.2 Sanskrit stop frequencies
In Zipf’s data, Sanskrit has a P(t)>P(p)>P(k) ranking. Is that order a stable one?
What causes Sanskrit to deviate from the frequent P(t)>P(k)>P(p) order? The
Digital Corpus of Sanskrit (DCS) provides counts for the different types of syllables
of Sanskrit and their distribution over time.10 I processed the data to yield the
different distributions of labial, dental, retroflex, palatal and guttural (velar) voiceless
unaspirated stops over time in (5.4). The unusual P(t)>P(p)>P(k) order only holds
in the early and epic periods. In classical, medieval and late Sanskrit the more
frequent P(t)>P(k)>P(p) is found.11
(5.4) Sanskrit voiceless unaspirated stop frequencies
p t” t c k
early 14870 37738 1694 6773 9972
epic 209583 566986 29491 134040 194392
classical 131327 334920 21212 75075 135034
medieval 83316 231220 18176 47541 106276
late 51709 137705 10715 29688 63852
The change from a P(t)>P(p)>P(k) order to the more common P(t)>P(k)>P(p)
order shows that languages can and do change segment frequencies over time. But
what caused the less frequent order to surface in the first place?
Whitney (1879, II.42–44) claims that the palatal series, which are pronounced
today as alveolo-palatal affricates was originally derived from the velar series, and
was pronounced as true palatals.12 If so, the early stages of Sanskrit may represent a
transient period in which the frequency of /k/ still suffered from the phonemic split
between /k/ and /c/, which was later corrected by the language.
10http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/data/syllables/syllables.htm,retrieved January 2012.
11This analysis uses the dental /t”/ rather than the retroflex / t/ for the cross-linguistic comparisonsince it is more frequent than the retroflex /t/. Additionally, Zipf’s comparison of the frequenciesof /t/-sounds was always based on dental and alveolar /t/s.
12One important support for this claim is that in the database used above, the palatal nasal /ñ/is followed by non-palatals exactly three times (across all periods included in the database). Incontrast, /ñ/ is followed by a palatal stop 18, 063 times. Nasal assimilation to following alveolo-palatal affricates would be to the alveolar stop part of the affricate, a /t/ rather than to the palatalrelease.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 140
Another explanation is also possible. At its early stages, Sanskrit had two dorsal
series – palatal and velar, but only one labial series. Effort in MULE is related both to
articulatory effort and perceptual confusability. If listeners had to tell apart different
dorsals, it would increase the effort that is associated with faithfully transmitting
dorsals. Labials in such a language would be less effortful relative to dorsals. This
explanation predicts that in Hindi, in which there are two series of coronal stops
and one series of dorsal stops, the effort of /t/ might be relatively higher than it
usually is, possibly higher than that of /k/. This prediction is borne out. In Hindi a
P(k)>P(t)>P(p) order is found (5.5) as /t”/ is less frequent than /k/.13
(5.5) Hindi character probabilities
/p/ 0.0266
/t”/ 0.0289
/ t/ 0.0057
/>tS/ 0.0116
/k/ 0.0714
If we take Hindi to be a later stage of Sanskrit, the various stages of the change of
ranking from P(p)>P(t)>P(k) through P(t)>P(k)>P(p) and finally P(k)>P(t)>P(p)
demonstrate two aspects of the correspondence between effort and segment frequency.
First, languages can and do change the frequencies of segment in response to the effort
associated with the articulation and confusability of segments. Second, unlike Zipf’s
assumptions, perceptual difficulty and not only articulatory complexity (or effort)
motivate the change in segment frequencies.
5.6.3 Indonesian stop frequencies
The data from Sanskrit in the previous section showed how segment frequencies may
change in response to phonetic change. Similarly, Indonesian had P(k)>P(t)>P(p)
ranking which deviates from the frequent P(t)>P(k)>P(p) ranking. Is it possible to
account for this ranking?
13Using character count based in http://www.sttmedia.com/characterfrequency-hindi, re-trieved January 2012
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 141
One way to analyze this deviation is to adopt the approach used in the previous
section for Sanskrit. Indonesian has only one series of stops in each place of articu-
lation – labial, coronal and dorsal. If Indonesian originally had an additional series
of coronal stops, the need to avoid confusion between the two series of coronals may
have caused the frequency of /t/ to drop. While there’s evidence that can support
this hypothesis I will not pursue it here, as that hypothetical series does not exist
today.14 Instead, I will try to understand whether the over-representation of /k/ is
stable, or whether the language is currently undergoing processes that may eventually
restore a P(t)>P(k)>P(p) ranking.
Chapter 3 suggests that phonological processes may follow from segments requiring
too much effort with respect to the information they provide. If so, Indonesian may
provide evidence for /k/-weakening processes. Such a process does exist. O’Brien
(2012) shows, based on data from Lapoliwa (1981), that Indonesian has a /k/→[P] in
coda positions, a process that does not apply to other stops in the same position. The
existence of this process can be predicted by MULE if the P(t)>P(k)>P(p) order is
the expected order in languages in which there is single series of labials, coronals and
dorsals.
5.7 Conclusion
In this chapter I argued that the frequencies of segments in human language is neg-
atively correlated with their effort. The more effort their articulation requires and
the more confusable they are with other segments, the less frequent they will be.
This correlation is not predicted from effort avoidance alone, nor from information
theoretic constraints alone. Instead, I argue that what languages maximize is the
amount of effort that is required to transmit a message of an arbitrary length – the
ratio between the entropy of the language and its markedness.
This proposal has significant predictive power. It predicts the relationship between
the perceptual confusability of segments and their frequency as in the case of Sanskrit
14 Indonesian has three series of stops – labial, alveolar and velar. In addition, it has a fourthretroflex nasal / n/, as well as voiceless and voiced alveolo-palatal affricates. If the affricates and theretroflex / n/ were once a part of a retroflex series of stops, then like Hindi, that stage of Indonesianhad two series of coronal stops.
CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 142
and Hindi. Expanding on chapter 3, it predicts the propensity of segments to weaken
when they are too frequent as in the case of Indonesian.
The use of this proposal has implications for the study of markedness, as it provides
a tool that allows linguists to compare the markedness of segments without the use
of hundreds of languages, by using the frequencies of segments in the languages in
which they do appear. Given both cross-linguistic frequencies and a great number
of languages, this proposal may allow us to approximate not only the ranking of the
markedness of segments, but also the actual value of the markedness of a segment in
system with a given set of phonemes.
Chapter 6
Conclusions
Finding the dividing line between what keeps languages similar to one another and
different from one another is one of the key challenges of linguistics. It is some-
times the case that it is the differences between languages that reveal the similarities.
This is the case in this thesis. The challenge in chapters 2 and 3 was to understand
language-specific patterns – the reason languages deviate from cross-linguistic tenden-
cies. Chapter 2 attempted to understand why a language such as American English
would so carefully preserve segments that are less likely to be found in the sound
inventories of the world’s languages, such as /p/ and /g/, while reducing sounds that
are almost never absent from sound systems, such as /t/. Chapter 3 investigated
the reason behind parallel weakening processes of /t/ in English and of /q/ in Ara-
bic. The answer in both cases was that speakers are willing to put in more effort to
guarantee the transmission of higher amounts of information.
MULE assumes speakers have two goals: transmit information, and do so using as
little effort as possible. This observation allows MULE to move from language-specific
patterns to predicting cross-linguistic patterns. Chapter 4 showed that languages
distribute perceptual resources as they distribute effort – highly informative sounds
are more likely to appear in the onsets of stressed syllables. In chapter 5 I showed how
MULE predicts the observed cross-linguistic similarity in the frequency of different
segments without any additional stipulations. Thus, the need to explain language-
specific patterns predicts cross-linguistic universals. Both language-specific patterns
and cross-linguistic universals stem from a single linguistic principle.
143
CHAPTER 6. CONCLUSIONS 144
In MULE I propose a joint treatment for phonetic and information theoretic con-
straints in linguistics. As such, it is easy to extend and build on MULE to provide
explanations for phenomena that require the integration of both factors. Such expla-
nations need not be limited to phonology, as the definition of effort and information
can and should be adapted to other domains.
Bibliography
Adams, Matthew E., Schweitzer, Katrin, and Cohen Priva, Uriel, 2009. Crosslinguis-
tic evidence for phone informativity: a corpus study of German. Talk delivered
at the 83rd Annual Meeting of the Linguistic Society of America, San Francisco,
January 10.
Akaike, Hirotugu. 1974. A new look at the statistical model identification. Institute
of Electrical and Electronics Engineers. Transactions on Automatic Control, 19(6):
716–723.
Al-Nassir, Abdulmunim A. 1993. Sibawayh the phonologist: a critical study of the
phonetic and phonological theory of Sibawayh as presented in his treatise al-Kitab.
Kegan Paul Internat., London [u.a.].
Anttila, Arto. 1997. Deriving variation from grammar: a study of Finnish genitives. In
Hinskens, Frans, Hout, Roeland van, and Wetzels, Leo, editors, Variation, change
and phonological theory, pages 35–68. John Benjamins, Amsterdam.
Anttila, Arto. 2006. Variation and opacity. Natural Language and Linguistic Theory,
24(4):893–944.
Aylett, Matthew and Turk, Alice. 2004. The smooth signal redundancy hypothesis: a
functional explanation for relationships between redundancy, prosodic prominence,
and duration in spontaneous speech. Language and Speech, 47(1):31–56.
Aylett, Matthew and Turk, Alice. 2006. Language redundancy predicts syllabic du-
ration and the spectral characteristics of vocalic syllable nuclei. Acoustical Society
of America Journal, 119:3048–3058.
145
BIBLIOGRAPHY 146
Baayen, R. H. 2011. languageR: Data sets and functions with ”Analyzing Linguistic
Data: A practical introduction to statistics”. URL http://cran.r-project.org/
package=languageR. R package version 1.4.
Baayen, R. H., Piepenbrock, R., and Gulikers, L., 1995. The CELEX lexical database
[Release 2].
Baayen, R.H., Davidson, D.J., and Bates, D.M. 2008. Mixed-effects modeling with
crossed random effects for subjects and items. Journal of Memory and Language,
59(4):390–412.
Bates, Douglas, Maechler, Martin, and Bolker, Ben. 2011. lme4: Linear mixed-effects
models using S4 classes. URL http://cran.r-project.org/package=lme4. R
package version 0.999375-42.
Beckman, Jill N. 1998. Positional Faithfulness. PhD thesis, University of Mas-
sachusetts Amherst. ROA 234.
Bell, Alan, Brenier, Jason, Gregory, Michelle, Girand, Cynthia, and Jurafsky, Daniel.
2009. Predictability effects on durations of content and function words in conver-
sational English. Journal of Memory and Language, 60(1):92–111.
Blevins, Juliette. 2004. The Mystery of Austronesian Final Consonant Loss. Oceanic
Linguistics, 43(1):208–213. URL http://www.jstor.org/stable/3623380.
Boersma, Paul. 1998. Functional Phonology. PhD thesis, University of Amsterdam.
Boersma, Paul. 2003. The odds of eternal optimization in Optimality Theory. In Holt,
Eric D., editor, Optimality Theory and Language Change, pages 31–66. Kluwer,
Dordecht.
Boersma, Paul and Hayes, Bruce. 2001. Empirical tests of the Gradual Learning
Algorithm. Linguistic Inquiry, 32(1):45–86.
Brants, Thorsten and Franz, Alex, 2006. Web 1T 5-gram Corpus [Version 1.1].
Google, Inc.
BIBLIOGRAPHY 147
Brants, Thorsten and Franz, Alex, 2009. Web 1T 5-gram, 10 European Languages
Version 1. Linguistic Data Consortium, Philadelphia.
Bresnan, Joan and Nikitina, Tatiana. 2009. The gradience of the dative alternation.
In Uyechi, Linda and Hee Wee, Lian, editors, Reality exploration and discovery:
pattern interaction in language and life. Festschrift for K.P. Mohanan. CSLI Pub-
lications, Stanford.
Buck, Carl Darling. 1904. A grammar of Oscan and Umbrian: with a collection
of inscriptions and a glossary. Ginn & Company. http://www.archive.org/
details/agrammaroscanan00goog.
Bybee, Joan and Hopper, Paul. 2001. Introduction to frequency and emergence of
linguistic structure . In Bybee, Joan and Hopper, Paul, editors, Frequency and
the Emergence of Linguistic Structure, pages 1–24. John Benjamins Publishing
Company, Amsterdam.
Carter, M. G. 2004. Sibawayhi. I.B. Tauris, London; New York.
Cieri, Christopher, Miller, David, and Walker, Kevin. 2004. The Fisher Corpus: a
Resource for the Next Generations of Speech-to-Text. In Proceedings of the 4th
International Conference on Language Resources and Evaluation, pages 69–71.
Cieri, Christopher, Graff, David, Kimball, Owen, Miller, Dave, and Walker, Kevin,
2005. Fisher English Training Part 2, Transcripts. Linguistic Data Consortium,
Philadelphia.
Cohen Priva, Uriel. 2008. Using information content to predict phone deletion. In
Abner, Natasha and Bishop, Jason, editors, Proceedings of the 27th West Coast
Conference on Formal Linguistics, pages 90–98, Somerville, MA. Cascadilla Pro-
ceedings Project.
Cohen Priva, Uriel. 2010. Constructing typing-time corpora: A new way to answer
old questions. In Ohlsson, S. and Catrambone, R., editors, Proceedings of the 32nd
Annual Conference of the Cognitive Science Society, pages 43–48, Austin, TX.
Cognitive Science Society.
BIBLIOGRAPHY 148
Cohen Priva, Uriel and Jurafsky, Dan, 2008. Phone Information Content Influences
Phone Duration. A poster presented at Prosody08, Cornell University. http:
//www.stanford.edu/~urielc/files/Prosody08Poster.pdf.
Dmitrieva, Olga and Anttila, Arto, 2008. The gradient phonotactics of English CVC
syllables. Poster presented at LabPhon11, Wellington, New Zealand, June 30.
Ferreira, Victor S. and Dell, Gary S. 2000. Effect of Ambiguity and Lexical Avail-
ability on Syntactic and Lexical Production. Cognitive Psychology, 40(4):296–340.
Flemming, Edward. 2004. Contrast and perceptual distinctiveness. In Hayes, B.,
Kirchner, R., and Steriade, D., editors, Phonetically-Based Phonology, pages 232–
276. Cambridge University Press. An online version is available at http://web.
mit.edu/flemming/www/paper/CandP13.pdf.
Fox Tree, J. E. and Clark, H. H. 1997. Pronouncing the as thee to signal problems
in speaking. Cognition, 62:151–167.
Garrett, Susan, Morton, Tom, and McLemore, Cynthia, 1996. CALLHOME Spanish
Lexicon. Linguistic Data Consortium, Philadelphia.
Gesenius, Heinrich Friedrich Wilhelm. 1910. Gesenius’ Hebrew Grammar. The
Clarendon Press.
Giavazzi, Maria. 2010. The Phonetics of Metrical Prominence and its Consequences
for Segmental Phonology. PhD thesis, Massachusetts Institute of Technology.
Godfrey, John J. and Holliman, Edward, 1997. Switchboard-1 Release 2. Linguistic
Data Consortium, Philadelphia.
Goldwater, Sharon and Johnson, Mark. 2003. Learning OT constraint rankings using
a maximum entropy model. In Proceedings of the Stockholm workshop on variation
within Optimality Theory, pages 111–120.
Gurevich, Naomi. 2004. Lenition and contrast : the Functional Consequences of
Certain Phonetically Conditioned Sound Changes. Routledge, New York.
BIBLIOGRAPHY 149
Guy, Greogry. 1991. Explanation in variable phonology: an exponential model of
morphological constraints. Language Variation and Change, 3(1):1–22.
Hahn, Reinhard F. and Ibrahim, Ablahat. 1991. Spoken Uyghur. University of
Washington Press.
Harris, James W. 1969. Spanish Phonology. MIT Press, Cambridge, Mass.
Haspelmath, Martin. 2006. Against markedness (and what to replace it with). Journal
of Linguistics, 42(01):25–70. doi: 10.1017/S0022226705003683.
Haspelmath, Martin. 2008. Creating economical morphosyntactic patterns in lan-
guage change. In Good, Jeff, editor, Linguistic Universals and Language Change,
pages 185–214. Oxford University Press, Oxford.
Hastie, T. J. and Pregibon, D. 1992. Generalized linear models. Wadsworth and
Brooks / Cole.
Hayes, Bruce. 1995. Metrical Stress Theory: Principles and Case Studies. University
of Chicago Press, Chicago.
Hickey, Raymond. 2009. Weak segments in Irish English. In Minkova, Donka, editor,
Phonological weakness in English: from Old to present-day English, pages 116–129.
Palgrave Macmillan, Basingstoke, England; New York.
Hochberg, Judith G. 1986. Functional compensation for /s/ deletion in Puerto Rican
Spanish. Language, 62:609–621.
Hockett, Charles Francis. 1955. A manual of phonology (International Journal of
American Linguistics, 21: 4, Part 1, Memoir 11). Waverly Press, Baltimore.
Horn, Laurence R. 1984. Toward a new taxonomy for pragmatic inference: Q-
based and R-based implicature. In Schiffrin, Deborah, editor, Meaning, form, and
use in context: linguistic applications, pages 11–42. Georgetown University Press,
Washington, D.C.
Huang, Shudong, Bian, Xuejun, Wu, Grace, and McLemore, Cynthia, 1996. CALL-
HOME Mandarin Chinese Lexicon. Linguistic Data Consortium, Philadelphia.
BIBLIOGRAPHY 150
Hume, Elizabeth. 2004. Deconstructing markedness: A predictability-based approach.
In Proceedings of the Berkeley Linguistic Society, volume 30, pages 182–198.
Hume, Elizabeth. 2008. Markedness and the language user. Phonological Studies, 11.
Ito, Junko and Mester, Armin. 2003. On the sources of opacity in OT: coda processes
in German. In Caroline Fery, and Ruben van de Vijver, , editors, The Syllable in
Optimality Theory, pages 271–303. Cambridge University Press.
Jaeger, T. Florian. 2010. Redundancy and reduction: Speakers manage syntac-
tic information density. Cognitive Psychology, 61(1):23–62. ISSN 0010-0285.
doi: 10.1016/j.cogpsych.2010.02.002. URL http://www.sciencedirect.com/
science/article/B6WCR-4YYVCTH-1/2/5ec8d0317cdf485174bb2a87031dd506.
Jurafsky, Daniel, Bell, Alan, Gregory, Michelle L., and Raymond, William D. 2001.
Probabilistic relations between words: Evidence from reduction in lexical produc-
tion. In Bybee, Joan L. and Hopper, Paul, editors, Frequency and the Emergence
of Linguistic Structure, pages 229–254. Benjamins, Amsterdam.
Kahn, Daniel. 1976. Syllable-based Generalizations in English Phonology. PhD thesis,
Massachusetts Institute of Technology.
Kaplan, Abby. 2010. Phonology Shaped by Phonetics: The Case of Intervocalic
Lenition. PhD thesis, Univrersity of California Santa Cruz. ROA 1077.
Kaye, Alan S. and Rosenhouse, Judith. 1997. Arabic dialects and Maltese. In Hetzron,
Robert, editor, The Semitic languages, pages 263–311. Routledge, New York.
Kilany, Hanaa, Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A.,
Karins, K., Rowson, E., MacIntyre, R., Kingsbury, P., and McLemore, C., 1997.
LDC Egyptian Colloquial Arabic Lexicon. Linguistic Data Consortium, University
of Pennsylvania.
Kiparsky, Paul, 1993. An OT perspective on phonological variation. Handout from
Rutgers Optimality Workshop 1993, also presented at NWAVE 1994, Stanford
University. Available online at http://www.stanford.edu/~kiparsky/Papers/
nwave94.pdf.
BIBLIOGRAPHY 151
Kiparsky, Paul, 1994. Remarks on markedness. Handout of talk presented at TREND-
2. Available online at http://www.stanford.edu/~kiparsky/Papers/trend.pdf.
Kiparsky, Paul. 1995. The phonological basis of sound change. In Goldsmith, John A.,
editor, The Handbook of Phonological Theory, pages 640–670. Blackwell Publishers,
Cambridge, MA.
Kirchner, Robert Martin. 1998. An Effort-Based Approach to Consonant Lenition.
PhD thesis, University of California Los Angeles. ROA 276 http://roa.rutgers.
edu/view.php3?roa=276.
Kisseberth, Charles W. 1970. On the functional unity of phonological rules. Linguistic
Inquiry, 1(3):291–306.
Kobayashi, Megumi, Crist, Sean, Kaneko, Masayo, and McLemore, Cynthia, 1996.
CALLHOME Japanese Lexicon. Linguistic Data Consortium, Philadelphia.
Labov, W. 1994. Principles of Linguistic Change: Internal Factors. Wiley-Blackwell.
de Lacy, Paul. 2002. The Formal Expression of Markedness. PhD thesis, University
of Massachusetts Amherst.
Lapoliwa, H. 1981. A Generative Approach to the Phonology of Bahasa Indonesia.
Department of Linguistics, Research School of Pacific Studies, Australian National
University.
Lemon, Jim and Fellows, Ian. 2007. concord: Concordance and reliability. R package
version 1.4-9.
Levenshtein, Vladimir Iosifovich. 1966. Binary codes capable of correcting deletions,
insertions, and reversals. Soviet Physics Doklady, 10(8):707–710.
Levy, Roger and Jaeger, T. Florian. 2007. Speakers optimize information density
through syntactic reduction. In Scholkopf, Bernhard, Platt, John, and Hofmann,
Thomas, editors, Advances in Neural Information Processing Systems (NIPS), vol-
ume 19, pages 849–856, Cambridge, MA. MIT Press.
BIBLIOGRAPHY 152
Lombardi, Linda, 1995. Why place and voice are different. Rutgers Optimality
Archive (ROA) 105.
Maddieson, Ian. 1984. Patterns of sounds. Cambridge studies in speech science and
communication. Cambridge University Press, Cambridge.
Mathisen, Anne Grethe. 1999. Sandwell, West Midlands: Ambiguous perspectives
on gender patterns and models of language change. In Foulkes, Paul and Docherty,
Gerard J., editors, Urban Voices: Accent Studies in the British Isles, pages 107–123.
Arnold Publishers.
McCarthy, John J. 1994. The phonology and phonetics of Semitic pharyngeals.
Phonological structure and phonetic form: papers from Laboratory Phonology III,
pages 191–234.
McCarthy, John J. and Prince, Alan. 1995. Faithfulness and reduplicative identity.
University of Massachusetts Occasional Papers in Linguistics, 18:249–384.
Miller, George A. 1995. WordNet: a lexical database for English. Commun. ACM,
38(11):39–41. ISSN 0001-0782. doi: 10.1145/219717.219748.
O’Brien, Jeremy. 2012. An Experimental Approach to Debuccalization and Supple-
mentary Gestures. PhD thesis, Univrersity of California Santa Cruz.
Ohala, John J. 1981. The origin of sound patterns in vocal tract constraints. In
Macneilage, P., editor, The Production of Speech. Springer Verlag, New York.
Ohala, John J. 2003. Phonetics and historical phonology. In Joseph, Brian D. and
Janda, Richard D., editors, The Handbook of Historical Linguistics, pages 669–686.
Blackwell.
Piantadosi, Steven .T., Tily, Harry J., and Gibson, Edward. 2009. The communicative
lexicon hypothesis. In The 31st annual meeting of the Cognitive Science Society
(CogSci09), pages 2582–2587. URL http://web.mit.edu/piantado/www/papers/
piantadosiTilyGibson2009.pdf.
BIBLIOGRAPHY 153
Piantadosi, Steven T., Tily, Harry J, and Gibson, Edward. 2011. Word lengths are
optimized for efficient communication. Proceedings of the National Academy of
Sciences.
Pierrehumbert, Janet. 2001. Exemplar dynamics: Word frequency, lenition and
contrast. In Bybee, Joan and Hopper, Paul, editors, Frequency and the Emergence
of Linguistic Structure, pages 137–157. John Benjamins Publishing Company.
Pitt, M.A., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., and Fosler-
Lussier, E., 2007. Buckeye Corpus of Conversational Speech (2nd release). Depart-
ment of Psychology, Ohio State University.
Pluymaekers, Mark, Ernestus, Mirjam, and Baayen, R. Harald. 2005. Articulatory
planning is continuous and sensitive to informational redundancy. Phonetica, 62:
146–159.
Poplack, Shana, 1980. The Notion of the Plural in Puerto Rican Spanish: Competing
Constraints on (s) Deletion.
Pouplier, Marianne. 2003. The dynamics of error. In Proceedings of the 15th Inter-
national Congress of Phonetic Sciences.
Prince, Alan S. and Smolensky, Paul. 1993. Optimality Theory: Constraint Interac-
tion in Generative Grammar. Blackwell, Malden, MA.
R Development Core Team, . 2011. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http:
//www.r-project.org.
R Development Core Team, . 2012. R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http:
//www.r-project.org.
Raymond, Hickey. 2004. Irish English: Phonology, volume 1 of Varieties of English,
pages 68–97. Mouton de Gruyter, Berlin ; New York.
BIBLIOGRAPHY 154
Raymond, William D., Dautricourt, Robin, and Hume, Elizabeth. 2006. Word-medial
/t,d/ deletion in spontaneous speech: Modeling the effects of extra-linguistic, lexi-
cal, and phonological factors. Language Variation and Change, 18.
Shannon, Claude Elwood. 1948. A mathematical theory of communication. The Bell
System Technical Journal, 27:379–423.
Shannon, Claude Elwood. 1951. Prediction and entropy of printed English. Bell
System Technical Journal, 30:50–64.
Sherman, D. 1975. Stop and fricative systems: a discussion of paradigmatic gaps and
the question of language sampling. Working Papers on Language Universals, 17:
1–33.
Smith, Jennifer. 2002. Phonological Augmentation in Prominent Positions. PhD
thesis, UMass Amherst.
Smolensky, Paul. 1993. Harmony, markedness, and phonological activity. Paper
presented at Rutgers Optimality Workshop 1.
Smolensky, Paul, 1995. On the internal structure of constraint component con of
UG. Talk given at UCLA, April 7. ROA 86.
Steriade, Donca. 1997. Phonetics in phonology: the case of laryngeal neutralization.
Ms., UCLA.
Steriade, Donca. 2008. The phonology of perceptibility effects: the P-map and its
consequences for constraint organization. In Hanson, Kristin and Inkelas, Sharon,
editors, The Nature of the Word: Studies in Honor of Paul Kiparsky, pages 151–
180. MIT, Cambridge, Mass.; London.
Surendran, Dinoj and Niyogi, Partha. 2006. Quantifying the functional load of
phonemic oppositions, distinctive features, and suprasegmentals. In Nedergaard
Thomsen, Ole, editor, Competing Models of Linguistic Change: Evolution and Be-
yond. In commemoration of Eugenio Coseriu (1921-2002). Benjamins, Amsterdam
and Philadelphia.
BIBLIOGRAPHY 155
Urszula, Clark. 2004. The English West Midlands: Phonology, volume 1 of Varieties
of English, pages 134–162. Mouton de Gruyter, Berlin ; New York.
Son, R. J. J. H. van and Pols, L. C. W. 2003. How efficient is speech? Proceedings
of the Institute of Phonetic Sciences, 25:171–184.
Son, R.J.J.H. van and Santen, J.P.H van. 2005. Duration and spectral balance of
intervocalic consonants: a case for efficient communication. Speech Communication,
47:100–123.
Venables, W. N. and Ripley, B. D. 2002. Modern Applied Statistics with S. Springer,
4th edition.
Weber, David. 1989. A grammar of Huallaga (Huanuco) Quechua, volume 112 of
University of California publications in linguistics. University of California Press,
Berkeley.
Weide, R., 1998. The CMU Pronunciation Dictionary, release 0.6. Carnegie Mellon
University.
Weinreich, Uriel, Labov, William, and Herzog, Marvin I. 1968. Empirical foundations
for a theory of language change. In Lehmann, Winfred P. and Malkiel, Yakov, ed-
itors, Directions for Historical Linguistics, pages 95–18. University of Texas Press,
Austin.
Whitney, William Dwight. 1879. A Sanskrit grammar; including both the classical lan-
guage, and the older dialects, of Veda and Brahmana. Breitkopf and Hartel, Leipzig.
URL http://www.archive.org/details/sanskritgrammari00whituoft.
Zhao, Yuan and Jurafsky, Dan. 2009. The effect of lexical frequency and Lombard
reflex on tone hyperarticulation. Journal of Phonetics, 37(2):231–247. ISSN 0095-
4470. doi: 10.1016/j.wocn.2009.03.002. URL http://www.sciencedirect.com/
science/article/pii/S0095447009000175.
Zipf, George K. 1935. The Psycho-biology of Language: an Introduction to Dynamic
Philology. Houghton, Mifflin.
BIBLIOGRAPHY 156
Zipf, George Kingsley. 1929. Relative frequency as a determinant of phonetic change.
Harvard Studies in Classical Philology, 15:1–95.
Zipf, George Kingsley. 1949. Human Behavior and the Principle of Least Effort: an
Introduction to Human Ecology. Hafner Publisher Company, New York.
Zue, Victor W. and Laferriere, Martha. 1979. Acoustic study of medial /t,d/ in
American English. The Journal of the Acoustical Society of America, 66(4):1039–
1050. doi: 10.1121/1.383323. URL http://link.aip.org/link/?JAS/66/1039/1.