sign and signal deriving linguistic …wg646gh4444/urielcohenpriva-dissertation...sign and signal...

.SIGN AND SIGNAL

DERIVING LINGUISTIC GENERALIZATIONS FROM

INFORMATION UTILITY

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF LINGUISTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Uriel Cohen Priva

August 2012

http://creativecommons.org/licenses/by-nc-nd/3.0/us/

This dissertation is online at: http://purl.stanford.edu/wg646gh4444

© 2012 by Uriel Cohen Priva. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

ii



http://purl.stanford.edu/wg646gh4444

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Daniel Jurafsky, Primary Adviser


Arto Anttila


Paul Kiparsky


Christopher Manning


Meghan Sumner

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Why do languages have such different phonological processes even though all speakers

share the same cognitive, articulatory and perceptual constraints? American English

preserves sounds such as /p/ and /g/ even though they are absent from the sound

systems of many of the world’s languages, but reduces sounds such as /t/ even though

it is one of the most frequently used sounds cross-linguistically. In contrast, Romance

languages reduce /s/, which American English preserves. What makes American

English have this particular set of phonological processes and not processes that

affect other languages?

I show that by assuming that speakers attempt to maximize the amount of infor-

mation they transmit while minimizing the amount of effort required to transmit that

information, it is possible to determine which sounds are more likely to be affected

by reduction processes and which sounds are more likely to be preserved in each lan-

guage. Unlike cognitive, perceptual and articulatory constraints, which are the same

for speakers of all languages, the amount of information languages assign to linguistic

elements, such as individual sounds, varies markedly. The more information a sound

carries, the more effort speakers are willing to expend to transmit it faithfully to

listeners. The trade-off between maximizing information and minimizing effort forms

the basis for a new framework I call MULE (Most information Utility, Least Effort).

MULE predicts preservation and reduction patterns in English and Arabic at the

levels of performance, competence and change, thereby providing a partial answer to

the actuation problem (Weinreich et al. 1968). MULE also predicts cross-linguistic

generalizations. I show that in American English, Egyptian Arabic and Spanish,

highly informative sounds are more likely to benefit from the perceptual prominence

iv

of the onsets of stressed syllables. Similarly the balance between effort and informa-

tion successfully predicts cross-linguistic asymmetries between the frequencies of less

effortful sounds and more effortful sounds. As such, MULE enhances the explanatory

power of linguistic theory, and provides a disciplined way to integrate phonetics and

information theoretic considerations.

v

To my parents

vi

Acknowledgments

There are no single authors in academic research. New research builds on previous

work and on the authors’ interaction with others: those who taught them how to do

research, and if they are lucky, those who taught them how to become better people.

I was fortunate to be a graduate student at Stanford Linguistics for the past few

years, the most collaborative, inspiring, and nurturing academic environment I know.

I owe many thanks to my professors, colleagues, administrative staff and friends at

Stanford for helping me overcome the perils of being an expatriate in an American

graduate school.

I am grateful to my committee for their intellectual support and patience. Each

of them has had an important role in my development as a graduate student. One

of the greatest benefits of going to Stanford was a chance to work with my adviser,

Dan Jurafsky. Dan’s curiosity, intellect and inherent dislike for pretheoretical consid-

erations are responsible for the most rapid exchanges of ideas that I have ever had.

Dan’s trust in me and his willingness to support me and to keep me on the right track

never ceased to astonish me. My graduate school experience and my work would have

suffered greatly if it were not for Dan’s intellectual influence and personal example.

Paul Kiparsky, a true Renaissance man, introduced me to many of the theoretical

concepts and challenges that have ultimately shaped my work. Our long conversa-

tions about linguistics, politics and life’s mysteries never took a predictable course,

and every turn revealed new insights, problems and solutions. Arto Anttila’s precise

approach to research encouraged me to better ground my arguments. Arto read the

greatest number of versions of any of my manuscripts, and I can see the effect of

his detailed feedback in almost every page. Chris Manning’s ability to understand

immediately any topic I wished to discuss and to provide me with useful input and

vii

criticism is admirable. Even more admirable is his ability to teach me not only how

to argue for my ideas, but more importantly how not to argue for them. Finally,

Meghan Sumner showed me the beauty of phonetics and is responsible for my desire

to ground phonological arguments in phonetics. Meghan sets herself as an example

at the academic and personal level. In doing so she inspires others to follow. Thank

you!

One of the things I am most grateful for in my experience as a graduate student is

the sense of having an adoptive family, of a home away from home. While many took

part in making the department warm and welcoming, a few deserve special thanks.

Beth Levin’s door was always open for students to wander in, and I often sought her

counsel. Beth would listen, advise and help in whichever way she could. The fact

that she did all that and at the same time made sure I met every necessary deadline

is remarkable. Penny Eckert listened, encouraged, and offered advice. I made many

more random trips to the departmental kitchen in hope of catching a word with her.

Ivan Sag made sure I remained inspired and provided humor-infused theoretical and

personal insights. Vera Gribanova made post-graduation work seem reachable, and

provided more support than I dared to wish for. I remember and appreciate your

support.

I thank my many friends at Stanford for helping me survive graduate school with

a smile on my face. I had a wonderful and supportive cohort. Matthew Adams has

been my fellow phonologist, counselor and friend from my first weeks at Stanford,

and shared with me the ups and downs of graduate school. Roey Gafter miraculously

managed to know how I feel and what I was about to do well before I did. I consider

myself fortunate to have you two as my friends. Inbal Arnon, Elisabeth Norcliffe

and Hal Tily successfully lured me into San Francisco to prove that there’s life that

does not involve Stanford even if it does involve linguistics (and the sounds of a

band playing at Revolution). I would not have known Laura Smith and Fabian

Goppelsroeder had I not gone to Stanford, and that would have been a terrible thing

indeed. To my Israeli friends, thank you for making sure you were just a phone call

away.

Ariela Raviv has unwittingly undertaken the task of standing for sanity and com-

mon sense in my life through graduate school, a task she managed with ease, equipped

viii

with wickedly sharp humor, a down-to-earth approach and very little patience for

delusions of any sort. She proposed once that by doing so she put me through med

school even though I am not a real doctor. She has a point there, and there is a few

years’ worth of chat history to prove that. For all of her support and for her stout

belief in me, I am grateful.

Finally, I dedicate this thesis to my parents, who wanted me to go abroad and

pursue what was good for me, even when it was clearly hard for them that I was away.

I am grateful for their encouragement to keep challenging myself, and for always being

happier with my accomplishments than anyone else.

ix

Contents

Abstract iv

Acknowledgments vii

1 Introduction 1

1.1 Explaining the phonology of a language . . . . . . . . . . . . . . . . . 1

1.2 Language specific patterns . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Predicting language-specific patterns . . . . . . . . . . . . . . . . . . 3

1.4 Balancing effort and information: MULE . . . . . . . . . . . . . . . . 5

1.5 Sign and signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Cross-linguistic generalizations . . . . . . . . . . . . . . . . . . . . . . 8

1.7 The explanatory power of MULE . . . . . . . . . . . . . . . . . . . . 9

2 Information content affects performance 10

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Stop deletion and duration paradox . . . . . . . . . . . . . . . . . . . 11

2.3 Previous accounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Phonetic accounts . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 Markedness accounts . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Frequency accounts . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.4 Local predictability accounts . . . . . . . . . . . . . . . . . . . 18

2.4 Informativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Segment duration and deletion studies . . . . . . . . . . . . . . . . . 25

2.5.1 Studies overview . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.2 Intervocalic consonant duration . . . . . . . . . . . . . . . . . 26

x

2.5.3 Intervocalic segment deletion . . . . . . . . . . . . . . . . . . 31

2.5.4 Postvocalic segment duration . . . . . . . . . . . . . . . . . . 33

2.5.5 Postvocalic segment deletion . . . . . . . . . . . . . . . . . . . 34

2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 Segment Alignment Process . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 Calculating information theoretic measurements . . . . . . . . . . . . 41

2.9 Deletion and duration models . . . . . . . . . . . . . . . . . . . . . . 44

2.9.1 Intervocalic segment duration model . . . . . . . . . . . . . . 44

2.9.2 Intervocalic segment deletion model . . . . . . . . . . . . . . . 45

2.9.3 Postvocalic segment duration model . . . . . . . . . . . . . . . 46

2.9.4 Postvocalic segment deletion model . . . . . . . . . . . . . . . 47

3 Faithfulness as Information Utility 48

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.2 Parallel weakening processes . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.1 Weakening patterns are not arbitrary . . . . . . . . . . . . . . 51

3.2.2 Same language, same segments, multiple processes . . . . . . . 52

3.2.3 Same language, different dialects, similar processes . . . . . . 54

3.2.4 The challenge of explaining parallel weakening . . . . . . . . . 58

3.3 The proposed account . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.3.1 Outline – replacing markedness hierarchies . . . . . . . . . . . 58

3.3.2 Using information utility and effort to predict parallel weakening 60

3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.4.1 Implementation overview . . . . . . . . . . . . . . . . . . . . . 63

3.4.2 Measuring information utility . . . . . . . . . . . . . . . . . . 63

3.4.3 Integrating effort into MULE . . . . . . . . . . . . . . . . . . 65

3.5 MULE in OT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5.1 Effort and information utility in OT . . . . . . . . . . . . . . 66

3.5.2 Binary comparisons in OT . . . . . . . . . . . . . . . . . . . . 67

3.5.3 Real-valued comparisons in OT . . . . . . . . . . . . . . . . . 73

3.6 The necessity of multiple scales and language-specificity . . . . . . . . 80

3.6.1 The difference between MULE and current theories . . . . . . 80

3.6.2 Parallel weakening in standard OT . . . . . . . . . . . . . . . 80

xi

3.6.3 No single scale can replace markedness . . . . . . . . . . . . . 83

3.6.4 Universal scales do not suffice . . . . . . . . . . . . . . . . . . 83

3.7 Information-theoretic accounts . . . . . . . . . . . . . . . . . . . . . . 85

3.7.1 Information-theoretic explanations . . . . . . . . . . . . . . . 85

3.7.2 Functional load as entropy . . . . . . . . . . . . . . . . . . . . 85

3.7.3 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.7.4 Predictability . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.7.5 Informativity accounts . . . . . . . . . . . . . . . . . . . . . . 94

3.7.6 Why information-theoretic accounts do not suffice . . . . . . . 95

3.8 Variable deletion rates of stems and affixes . . . . . . . . . . . . . . . 96

3.8.1 The contrast between American English and Puerto Rican Span-

ish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.8.2 The information of English verbal -ed morpheme . . . . . . . 98

3.8.3 The information of Spanish plural -s morpheme . . . . . . . . 100

3.8.4 MULE’s predictions are measurable . . . . . . . . . . . . . . . 101

3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4 Lexicon, usage and information 105

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.2 Methodology and sources of data . . . . . . . . . . . . . . . . . . . . 107

4.2.1 The choice of test cases . . . . . . . . . . . . . . . . . . . . . . 107

4.2.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.3 Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.3.1 American English . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.3.2 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.3.3 Egyptian Arabic . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.4 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5 Predicting segment distribution universals 125

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.2 Consistent asymmetries between complex and simple segments . . . . 127

5.3 Solution: maximizing information per effort . . . . . . . . . . . . . . 130

xii

5.4 Survey 1: effort or complexity? . . . . . . . . . . . . . . . . . . . . . 133

5.5 Survey 2: voiceless stops . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.6 Markedness as effort . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.6.1 Missing pieces in the puzzle . . . . . . . . . . . . . . . . . . . 138

5.6.2 Sanskrit stop frequencies . . . . . . . . . . . . . . . . . . . . . 139

5.6.3 Indonesian stop frequencies . . . . . . . . . . . . . . . . . . . 140

5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

6 Conclusions 143

xiii

List of Tables

2.1 Buckeye word-medial stop relative durations . . . . . . . . . . . . . . 13

2.2 Buckeye word-medial stop deletion probabilities . . . . . . . . . . . . 13

2.3 Buckeye word-medial stop duration, deletion and probability . . . . . 18

2.4 Buckeye word-medial stop duration, deletion and informativity . . . . 24

2.5 Segment properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Environment properties . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.7 Variables of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.8 Dictionary and surface alignment penalties . . . . . . . . . . . . . . . 38

2.9 Buckeye to CMU valid substitution . . . . . . . . . . . . . . . . . . . 39

3.1 Egyptian Arabic Informativity-based information estimates . . . . . . 71

3.2 English Informativity-based information estimate . . . . . . . . . . . 74

3.3 Spanish Informativity-based information estimate . . . . . . . . . . . 75

3.4 Post-vocalic pre-consonantal word-final obstruent deletion controls . . 78

3.5 Post-vocalic pre-consonantal word-final obstruent deletion fixed effects 79

3.6 Functional load of English with different final consonant deletion, scaled

by a factor of 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.1 Segment phonological properties∗ . . . . . . . . . . . . . . . . . . . . 110

4.2 Sample American English stress data . . . . . . . . . . . . . . . . . . 113

4.3 American English pure phonology model . . . . . . . . . . . . . . . . 114

4.4 American English information model . . . . . . . . . . . . . . . . . . 115

4.5 Sample Spanish stress data . . . . . . . . . . . . . . . . . . . . . . . . 116

4.6 Spanish pure phonology model . . . . . . . . . . . . . . . . . . . . . . 117

4.7 Spanish information model . . . . . . . . . . . . . . . . . . . . . . . . 118

xiv

4.8 Sample Egyptian Arabic stress data . . . . . . . . . . . . . . . . . . . 119

4.9 Egyptian Arabic pure phonology model . . . . . . . . . . . . . . . . . 120

4.10 Egyptian Arabic information model . . . . . . . . . . . . . . . . . . . 121

5.1 Voiceless and voiced segment probabilities . . . . . . . . . . . . . . . 135

xv

List of Figures

5.1 Distance of total orders from /t/>/k/>/p/ . . . . . . . . . . . . . . . 137

xvi

Chapter 1

Introduction

1.1 Explaining the phonology of a language

One of the goals of linguistic theory is to explain which languages and linguistic

processes are possible, and which languages and linguistic processes cannot exist. The

same goal applies in the context of phonology. Phonology should be able to explain

the existence of observed phonological processes and be able to exclude phonological

processes that do not exist. Consider the case of word-final deletion. Puerto Rican

Spanish variably deletes word-final /s/ (Poplack, 1980). American English does not

delete word-final /s/ but variably deletes word-final /t/ (Guy, 1991, among others). In

contrast, cases of word-final consonant epenthesis are exceptionally rare. Phonology

as a theory should define the set of possible languages such that it includes languages

that delete word-final /s/, languages that do not delete word-final /s/ (but possibly

delete word-final /t/), and excludes languages that epenthesize consonants word-

finally.

But describing the set of possible languages does not suffice. Phonology should

have another goal as well: to be able to explain the correspondence between the

phonology of a language and other properties of the language such as its lexicon

and its usage patterns. The first goal requires phonological theory to describe the

phonology of a language that does not delete word-final /s/ but does delete word-

final /t/. The second goal requires phonological theory to explain why American

English has that particular phonology, rather than a phonology in which word-final

1

CHAPTER 1. INTRODUCTION 2

/t/ is preserved and word-final /s/ is deleted. I label the second goal of phonological

theory the correspondence problem. The correspondence problem takes many forms

and applies in all levels of phonological representation. At the level of linguistic

performance, answering the correspondence problem can take the form of explaining

the different deletion rates of segments in a language. At the level of competence and

change, the goal of the correspondence problem overlaps in part with the actuation

problem in Weinreich et al. (1968) – what causes some language at some point in time

to undergo a particular change process. Timing is not the focus of the correspondence

problem, but rather the set of processes that are likely to affect some languages but

not other languages.

1.2 Language specific patterns

Chapters 2 and 3 deal with several cases that demonstrate that the phonology of a

language is not random, but corresponds to other properties of the language. Chapter

2 explains what leads the consonants of American English to have the duration and

deletion rates that they have. I show that the durations and deletion rates of American

English consonants do not follow from the phonetic properties of its segments. /g/

has the longest duration of all voiced stops, even though Ohala (1981) argues that

maintaining voicing is more difficult for dorsals. Similarly, Ohala claims that /p/ is

the least audible voiceless stop. Less audible stops are more likely not to be perceived

by listeners, which may lead to increased deletion rates, yet in American English /p/ is

the least likely to delete of all voiceless stops. The phonetic markedness of /g/ and /p/

is supported by cross-linguistic evidence. /g/ is absent from more sound inventories

than any other voiced stops, and /p/ is absent from more sound inventories than any

other voiceless stop (Sherman, 1975). Therefore, the long duration of /g/ and the

low likelihood to delete /p/ do not follow from universal tendencies, but rather from

the specific properties of American English. What causes American English to have

such unusual patterns?

Chapter 3 explains why languages can be affected by multiple parallel weakening

processes that target specific segments but not other segments. Many varieties of

English have some form of /t/-weakening in intervocalic and word-final positions. In


Irish English alone, different varieties spirantize, tap or debuccalize intervocalic /t/

(1.1, data from Raymond 2004).

(1.1) Variety ‘butter’

Northern varieties [b2tˆ@ô]

Southern varieties [b2R@õ]

Vernacular Dublin [bUP5]

Tapping varieties are incompatible with debuccalizing varieties. Tapping changes the

manner of articulation of /t/ but preserves its place of articulation. Debuccalization

loses the place of articulation of /t/ but preserves its manner of articulation. The

common ancestor of tapping and debuccalizing varieties is therefore a variety in which

/t/ is a coronal stop, albeit a weakening-prone coronal stop. What makes /t/ prone

to undergo weakening in English and not in other languages?

It is not possible to argue that English /t/ is prone to weaken because it is a /t/.

Cross-linguistically, /t/-weakening processes are not very common as the language

surveys in Kirchner (1998) and Gurevich (2004) show. Moreover, other languages

have similar plethora of processes that target a specific segment, except that segment

is not /t/. Arabic, for instance, has multiple /q/-weakening processes (1.2, data from

Kaye and Rosenhouse 1997), but does not similarly weaken /t/.

(1.2) Dialect baqara ‘cow’ (MSA)

Druze [baqara]

Nazareth [bakara]

Jerusalem [baPara]

NW Jordan [bagara]

What causes English to have many /t/-weakening processes, and Arabic to have many

/q/-weakening processes? Even if phonology can easily describe all the processes

listed here, it should also be able to explain why English is the target of several

/t/-weakening processes, and Arabic of several /q/-weakening processes.

1.3 Predicting language-specific patterns

What are the factors that cause some language to have a particular set of phonological

processes? Linguistic theory focuses on the tension between universal constraints


and their language-specific interactions. Universal constraints apply to all languages

equally, but in each language they interact in a different way. Different interactions

lead to differences between languages. Optimality Theory (Prince and Smolensky,

1993) is a typical example of that approach; constraints are taken to be universal, and

their relative ranking is language-specific. There is a strong motivation to consider

all constraints universal. The articulatory, perceptual and psychological abilities and

limitations involved in communication are common to the speakers of every human

language. Solutions to the correspondence problem therefore cannot follow from the

introduction of language-specific constraints. Instead, such solutions should constrain

the possible rankings of universal constraints, such that the ranking that yields a

specific outcome is more likely to be found in some languages but not in others. The

question is therefore how to motivate different rankings in different languages.

Phonological markedness is often motivated by the phonetic properties of human

language such as the articulation and perception of sounds (Kirchner, 1998; Flem-

ming, 2004; Steriade, 2008, among many others). Some segments are considered more

effortful to produce than others, and some distinctions are more difficult to perceive

in certain phonetic environments. Phonetic constraints cannot be language-specific,

as they apply to every language in the exact same way. If it is more difficult to

tell whether pre-consonantal and word-final obstruents are voiced or not, it would be

equally difficult to do so in every language with a similar inventory of sounds. Thus,

phonetics-based reasoning plays an important role in defining the set of constraints

that universally forbid the existence of certain languages and processes.

Phonological theory posits another type of markedness, namely marked faithful-

ness (Kiparsky, 1994; de Lacy, 2002, among others). Marked faithfulness signifies the

greater pressure to preserve marked elements. Marked faithfulness can be used to

explain the case of word-final /t/-deletion in American English, from which the more

marked /k/ and /p/ are exempt. In such an analysis, /k/ and /p/ can resist the

weakening process that affects /t/ because language ranks the pressure to preserve

them higher than it ranks the pressure to preserve /t/. Marked faithfulness is difficult

to motivate on phonetic grounds. If some segment is more difficult to articulate or

perceive, what phonetic principle would encourage speakers to preserve it while not

preserving less effortful segments?


1.4 Balancing effort and information: MULE

I propose that speakers attempt to maximize the amount of information they transmit

(Most information Utility) and minimize the amount of articulatory and perceptual

effort communication requires (Least Effort), or MULE. If the goal of communication

is to transmit information between speakers and listeners, speakers should be willing

to put in more effort in order to transmit more information. In MULE marked

faithfulness is motivated by the preservation of information in human language, and

markedness is motivated by the reduction of perceptual and articulatory effort.

One theoretical benefit of using information as the basis for marked faithfulness is

that unlike the articulatory and perceptual properties of speech, different languages

assign varying amounts of information to different linguistic elements. Therefore, the

pressure to preserve linguistic elements is not the same cross-linguistically.

1.5 Sign and signal

What role does information play in human language? The reduction of frequent

words has been observed by Sibawayhi, an Arabic grammarian in the 8th century

(Al-Nassir, 1993; Carter, 2004). Zipf (1929) expected frequently used linguistic el-

ements to undergo reduction, which increases the ease of articulation. Zipf (1935)

described over-frequent sounds in a hypothetical language in terms of lack of informa-

tion: “completely unessential to a perfectly adequate conveyance of any word.” Zipf’s

latter description is very close to the terms that an account based on information the-

ory (Shannon, 1948) would use. In information theory, the amount of information

that an event carries is the negative log probability of observing the event. Everything

else being equal, frequent linguistic elements carry less information than infrequent

ones, and can therefore be reduced.

Information theory allows Zipf’s predictions to be extended from frequency to

predictability. Some linguistic elements can be infrequent, yet predictable in context.

The English word abode is less frequent than the word mansion, yet in the context

of my humble —, abode is more predictable than mansion, and therefore carries less

information. If speakers are careless in their pronunciation of abode in this context,

listeners would more easily recover the speakers’ intention. Recent research linked


predictability in context to phonetic and syntactic reduction (Jurafsky et al., 2001;

van Son and Pols, 2003; Aylett and Turk, 2004; Pluymaekers et al., 2005; Levy and

Jaeger, 2007; Jaeger, 2010, among many others). These studies show that linguistic

elements that are predictable in context are more likely to be omitted or otherwise

reduced.

There are many possible reasons that could lead linguistic elements with high

frequency or high predictability to be reduced or omitted. Some of the reasons are

speaker-internal and may or may not affect communication. Other reasons follow from

communication principles. It is easier to recover frequent and predictable words,

which may lead speakers to be lax in their pronunciation (Aylett and Turk, 2004,

among others). In this view frequent and highly predictable linguistic elements are

redundant, and their reduction makes language more efficient. A related view assumes

that speakers attempt to transmit as much information as language allows without

causing miscommunication (Levy and Jaeger, 2007; Jaeger, 2010). In this view trans-

mitting too much information or too little information is to be avoided, leading to the

omission and reduction of frequent and predictable linguistic elements, but possibly

also to the temporal elongation of unpredictable and infrequent linguistic elements.

MULE builds on research that regards the amount of information linguistic ele-

ments hold as an important factor in human communication, but differs in two im-

portant ways. First, there is ample research in phonology and phonetics that shows

that phonological markedness is related to articulatory and perceptual effort. For in-

stance, vowel inventories show a balance between effort and low confusability (Flem-

ming, 2004). Languages do not have vowel inventories in which higher effort does not

decrease confusability. Therefore, it is not necessary to assume that low information

leads to reduction. It suffices to assume that low-information (frequent, predictable)

linguistic elements do not resist effort reduction as well as high-information linguistic

elements do.

The other difference between MULE and the research it builds on is the impor-

tance it gives to the difference between language and abstract communication systems.

There is a tension between the optimization of a signal in a communication channel,

and the role of individual signs in the signal. Languages seem to differ from abstract


communication systems in what they regard as the information-carrying unit. In ab-

stract communication systems, the identity of each sign in a communication channel

is irrelevant. A /t/ in some word is completely unrelated to a /t/ in another word.

The fact that both /t/ sounds are phonetically similar to one another is coincidental.

But language does seem to care about the identity of linguistic elements. Cohen Priva

(2008) showed that speakers of American English omit consonants that are usually

predictable even in contexts in which they are not predictable, and preserve conso-

nants that are usually unpredictable even in contexts in which they are predictable.

Languages seem to optimize the communication signal while taking individual lin-

guistic elements (or signs) as the relevant level for optimization.

These two differences are the focus of this dissertation. In chapter 2 I show that

the durations and deletion rates of American English consonants correspond to the

amount of information consonants carry. The duration of /g/ is longer than the dura-

tion of other voiced stops because it carries an unusually high amount of information,

even though it takes greater articulatory effort to maintain the voicing of a dorsal

stop. The deletion rates of /p/ are lower than the deletion rates of other voiceless

stops because it is highly informative as well. The amount of information each con-

sonant carries is an aggregate of every instance of that segment in the language, and

applies even when a usually predictable segment is unpredictable, or when a usually

unpredictable segment happens to be predictable. The correct predictions for the

relative durations and deletion rates of American English require speakers to carry

over the amount of information each consonant usually holds in the language to every

individual instance of that consonant.

Similar principles solve the puzzle of chapter 3 by predicting the accumulation

of weakening processes that target specific segments in a given language. I show

that in English, /t/ is usually very predictable, which leads it to carry an unusually

low amount of information compared to other languages. In Arabic, /q/ carries less

information than less marked segments, and weakens because its phonetically marked

articulation cannot be justified by its relatively low information. It is not possible to

claim that /t/ is more effortful in English than it is in other languages. It is similarly

impossible to argue that Arabic reduces /q/ because it is too frequent, when more


frequent stops are not reduced. In both languages the key to predicting language-

specific weakening lies in the balance between information and effort, and in the

ability to attribute the average amount of information a linguistic element carries in

a language to each individual instance of that element.

1.6 Cross-linguistic generalizations

The balance between the preservation of information and effort-reduction applies dif-

ferently in every language because the amount of information a linguistic element

carries varies among languages. However, since the same principles apply cross-

linguistically, it is possible to move from language-specific patterns to predicting

cross-linguistic generalizations. This is the focus of chapters 4–5.

Chapter 4 shows that the assumption that language attempts to preserve infor-

mation predicts the distribution of segments in perceptually salient positions. I show

that in English, Spanish and Egyptian Arabic, segments that carry a lot of infor-

mation are more likely to appear in the onsets of stressed syllables. In the previous

two chapters, more informative linguistic elements justified the expenditure of addi-

tional articulatory effort. In the case of stressed syllables the perceptual prominence

of stressed syllables is distributed unevenly to more informative and less informative

linguistic elements. The more informative a linguistic element is, the more likely it

is to benefit from perceptual prominence. Thus language is shown to distribute its

resources such that the preservation of information is guaranteed. Moreover, chapter

4 shows that the considerations that apply to linguistic performance (chapter 2) as

well as linguistic competence and change (chapter 3) apply at the level of the lexicon

and the usage patterns of the language.

In chapter 5 I demonstrate how the balance between information and effort pre-

dicts that the frequency of some segments such as /t/ will usually be higher than

the frequency of other segments such as /p/. A pure information-theoretic account

predicts that all segments would be equally frequent, while a pure effort-avoidance

account predicts that all effortful segments would be avoided. Both predictions do

not hold in any language. I show that the frequency of segments in each language

matches their phonetic properties (Ohala, 1981) and their absence from the sound


systems of the world’s languages (Sherman, 1975). Thus, most voiceless stops are

usually more frequent than their voiced counterparts, but /p/ is not necessarily more

frequent than /b/, just as their phonetic properties and their absence from sound

systems cross-linguistically would predict. I discuss a new prediction that /t/ will be

more frequent than /k/ and /k/ will be more frequent than /p/. The cases in which

this prediction does not hold illuminate the need to factor both articulatory effort and

perceptual confusability into phonological markedness. Together, these predictions

show how even though the amount of information linguistic elements hold is different

in every language, all languages attempt to optimize their sound systems and lexicons

to achieve a balance between information and effort.

1.7 The explanatory power of MULE

MULE is the idea that transmitting information is a goal in human language, and

that languages are biased to allow the expenditure of linguistic resources in order to

transmit information. When more information is being transmitted, the expenditure

of additional resources is justfied. The following chapters show how MULE solves

four separate cross-linguistic puzzles.

Chapter 2

Information content affects

performance

2.1 Introduction

Speakers vary the duration of segments and occasionally delete (omit) segments. Such

phonetic properties are not taken to be governed by speakers’ linguistic competence.

Some speakers’ /t/ may be longer than their /k/, or they may delete /p/ more

frequently than they delete /k/, but such tendencies will not lead to a conclusion

that they have different competence grammars.1 However, in this chapter I show that

the duration and deletion patterns of segments in American English are systematic.

Speakers of American English do typically have longer and shorter segments, and

some segments are omitted more frequently than other segments. I demonstrate that

the key factor that leads to the different durations and occasional deletion rates is

the information content of segments. Everything else being equal, the higher the

information content of a segment, the longer its duration will be and the less likely

it is to be deleted. This finding shows that information content affects linguistic

performance, and foretells the generalization of such tendencies into competence-

related phonological rules.

In evaluating the information content of segments, I consider well-known factors

such as frequency and predictability, but show that informativity (Cohen Priva, 2008;

1I exclude here languages in which segment duration in contrastive.

10

CHAPTER 2. INFORMATION CONTENT AFFECTS PERFORMANCE 11

Piantadosi et al., 2011), the average or expected predictability of each segment, plays

a key role in explaining segment duration and deletion patterns. Informativity is the

amount of information a linguistic element usually has, across the entire language,

and can therefore explain why segments that are usually unpredictable have longer

duration and are less likely to delete even when they are predictable in context.

I begin this chapter by showing that American English voiceless, voiced and nasal

stops each have different duration and deletion rates among places of articulation –

/b/ is shorter and more likely to delete than /g/, but /k/ is shorter and more likely

to delete than /p/. This pattern highlights some of the shortcomings of markedness-

based theories (Prince and Smolensky, 1993; de Lacy, 2002), and is not predictable

from articulatory and perceptual factors (Ohala, 1981). I show why frequency (Zipf,

1929) and predictability-based explanations (Aylett and Turk, 2004; van Son and van

Santen, 2005) are also insufficient and necessitate the introduction of informativity. I

evaluate the relative effect information content measurements and in particular infor-

mativity have on segment duration and occasional deletion in several corpus studies

that weigh the effect each predictor has while factoring out other predictors. Finally,

I discuss the implications that the move to informativity has on the way information

theoretic constraints interact with linguistic performance and competence.

2.2 Stop deletion and duration paradox

In American English, the duration and the likelihood to delete of word-medial nasal

and oral stops is not consistent across places of articulation. While the voiced labial

stop /b/ has a shorter duration and is more likely to delete than voiced dorsal stop

/g/, the nasal and voiceless labial stops /m/ and /p/ have longer duration and are

less likely to delete than the nasal and voiceless dorsal stops /N/ and /k/. Table

2.1 provides the mean duration of each stop, and Table 2.2 provides the deletion

probabilities of each stop.

It is important to notice that the question asked here has to do with the rela-

tionship between underlying phonemes and their surface form. The question is not

what is the duration of glottalized /t/ (/t/→[P]), but rather what is the duration of

the phoneme /t/, which may happen to be glottalized in this context. Glottalization


in this interpretation is a process that can affect the duration of /t/. This duration

change should be explained, rather than taken as given.

I use data calculated using the invaluable Buckeye Corpus (Pitt et al., 2007),

which contains transcribed and time-aligned interviews with speakers of American

English in Columbus, Ohio, collected and annotated by researchers at the Ohio State

University. I matched the corpus’ underlying (dictionary) word representations with

their actual pronunciation. Words that did not have the same number of vowels as

their CMU dictionary (Weide, 1998) equivalents were excluded. Segments that had

no surface equivalent were considered to be deleted. This means that an articulatory

merger of two segments is regarded as a deletion of one of the segments.2 3 The

duration of the segments was multiplied by the rate of speech of the speaker, yielding

a more robust assessment of duration, in essence the proportion between the actual

duration of the segment and the mean duration of all segments in that utterance.4

The difference in mean duration is significant among all voiced, voiceless, and nasal

stops, with the exception of /n/ and /N/, which do not differ from one another

(Welch Two-Sample t-test p < 0.05).5 Similarly, the difference in likelihood to delete

is significant among all voiced, voiceless, and nasal stops (Fisher’s Exact Test p <

10−5).6 The data includes several speakers that were not used in Cohen Priva (2008).

What motivates the different duration and deletion ratios across places of ar-

ticulation? In the next sections I consider several possible explanations including

articulatory and perceptual factors, markedness, predictability and frequency, and

show that they do not suffice. I then reintroduce the concept of segment informa-

tivity, the expected value of segment predictability, and show how it provides the

2The deleted segment was considered to be the one less similar to the segment present in theoutput form, using a similarity metric detailed in §2.7.

3See §2.7 for a more complete description of the alignment process.4When segment duration, which is measured in seconds, is multiplied by speech rate, which

is measured in segments per second, the resulting measure unit is “segments”. Measuring a seg-ment’s duration in segments in essence compares the segment to other segments. Stressed vowelstend to have longer durations than unstressed vowels, and most consonants have shorter durationsstill. Therefore, the duration in segments of diphthongs is > 1, the duration in segments of shortconsonants such as /b/ is < 1.

5 Welch Two Sample t-test for stop durations: /d/ < /b/ (p < 10−15), /b/ < /g/ (p < 0.05),/t/ < /k/ (p < 10−15), /k/ < /p/ (p < 10−15), /N/ < /n/ (p > 0.97), /n/ < /m/ (p < 10−15)

6Fisher’s Exact Test for stop deletion: /d/ > /b/ (p < 10−15), /b/ > /g/ (p < 10−12), /t/ >/k/ (p < 10−15), /k/ > /p/ (p < 10−15), /n/ > /N/ (p < 10−7), /N/ > /m/ (p < 10−6)


Table 2.1: Buckeye word-medial stop relative durations

Place Voiceless Stops Voiced Stops Nasal StopsSegment Duration Segment Duration Segment Duration

Labial /p/ 1.123 /b/ 0.805 /m/ 0.881Dorsal /k/ 1.032 /g/ 0.829 /N/ 0.773Coronal /t/ 0.775 /d/ 0.587 /n/ 0.773

Table 2.2: Buckeye word-medial stop deletion probabilities

Place Voiceless Stops Voiced Stops Nasal StopsSegment Del. Prob. Segment Del. Prob. Segment Del. Prob.

Labial /p/ 0.013 /b/ 0.113 /m/ 0.025Dorsal /k/ 0.020 /g/ 0.054 /N/ 0.046Coronal /t/ 0.160 /d/ 0.175 /n/ 0.072

missing link to accounting for the different durations and deletion ratios.

2.3 Previous accounts

2.3.1 Phonetic accounts

Ohala (1981) lists at least two differences that may account for the asymmetry be-

tween the hierarchy of voiced and voiceless stops. First, it is more difficult to maintain

voicing in /g/ than in /b/ due to the smaller amount of space between the vocal folds

and the closure. In contrast, /p/ is the least audible voiceless stop. Both asymmetries

emerge in the cross-linguistic frequencies of gapped inventories – inventories in which

one or more of the stops does not exist. There are more gapped inventories in which

/g/ does not exist than gapped inventories in which /b/ does not exist, and more

gapped inventories in which /p/ does not exist than gapped inventories in which /k/

does not exist (Sherman, 1975).


While both asymmetries are well-established, their effect on the duration and

deletion ratios of American English stops is expected to be the opposite of what is

observed. If it is more difficult to maintain voicing in /g/ than in /b/, the duration

of /g/ should be shorter than the duration of /b/, but in American English the

duration of /b/ is shorter than the duration of /g/. If /p/ is less audible than other

voiceless stops, listeners will record more instances in which they did not perceive a

/p/ in places in which speakers have articulated a /p/, and will therefore have more

exemplars of (perceived) p-deletion which they may feel more comfortable repeating,

but in American English /k/ deletes more than /p/ does.

It seems that the difference between the durations and deletion rates of English

stops is not due to articulatory and perceptual reasons. I will now consider other

possible explanations: phonological markedness, frequency and predictability.

2.3.2 Markedness accounts

Markedness hierarchies, as in Prince and Smolensky (1993) and de Lacy (2002) are

used to provide an explanation for phone-specific patterns in phonology. In this

section I spell out such an attempt for the relevant data. Since coronals, which

are considered less marked than labials and dorsals, delete more frequently, we may

suggest that more marked segments delete less frequently than less marked ones.

Since coronals are taken to be less marked than labials and dorsals (Prince and

Smolensky, 1993; de Lacy, 2002), a markedness-based account can solve the contrast

between coronals and non-coronals (dorsals and labials). Word medially, coronals

have shorter duration and have higher deletion ratios than labials and dorsals, with

the exception of /N/, which has shorter duration and deletes more than /n/. The

tableau in (2.1) sketches a system in which coronals delete in coda positions, but

labials and dorsals do not delete, by using marked-faithfulness constraints of the

form Max{K} to preserve dorsals and Max{P} to preserve labials (I am not arguing

here for the existence of such constraints).


(2.1) Only coronals delete in codas

tat.tap.tak Max{K} Max{P} *NoCoda Max

tat.tap.tak ***!

+ ta.tap.tak ** *

tat.ta.tak *! ** *

tat.tap.ta *! ** *

ta.ta.tak *! * **

tat.ta.ta *! * * **

ta.tap.ta *! * **

ta.ta.ta *! * ***

However, the binary distinction between coronals and non-coronals does not ac-

count for the different durations and deletion ratios between dorsals and labials. If

the least marked (coronals) delete more than more marked stops (labials and dorsals),

then the data suggests that /b/ should be less marked than /g/ since it deletes more

frequently than /g/: Max{K} � Max{P}. However /k/ deletes more frequently

than /p/ does, and /k/ should be less marked than /p/: Max{P} �Max{K}. One

way to solve this problem is to further specify the Max constraints so that they spec-

ify not only place but also voicing: Max{K,voiced} �Max{P,voiced} and Max{P}� Max{K}.

Should the markedness of place of articulation differ between voiced and voiceless

stops? Indeed, in the UPSID database (Maddieson, 1984) more languages have /k/

than /p/ (403 : 375), and more languages have /b/ than /g/ (287 : 253), even though

this difference is not significant (Fisher test, p>0.05). If it were significant, it could be

used to argue that /g/ is more marked than /b/ and that /p/ is more marked than /k/.

Using that markedness hierarchy to argue for marked faithfulness constraints that

specify both place of articulation and voicing would explain the difference between

the durations and deletion ratios of voiced and voiceless oral stops using the following

hierarchy of Max constraints: Max{K, voiced} � Max{P, voiced}, and Max{P}� Max{K}.

However, using a similar procedure for nasal stops yields incorrect predictions.


More languages in UPSID have /m/ than /N/ (425 : 237). In comparison with

voiceless stops, this difference is significant (Fisher test, p < 0.001). If these data

are used to conclude that Max{K,nasal} � Max{P,nasal}, the prediction would be

that that /m/ should be shorter and more likely to delete than /N/, but this is not

the case in American English.

This proposal has several additional complications at the theoretical and func-

tional level. First, (de Lacy, 2002, §6.4.2) argues specifically against marked faithful-

ness accounts for the family of Max constraints. Second, while constraint conjunc-

tion is used liberally in markedness accounts to account for the conjunction of several

markedness features as in Ito and Mester (2003), the above-mentioned proposal uses

the conjunction of marked-faithfulness constraints, which may have other undesirable

consequences. Finally, if marked segments are less frequent in the world’s languages,

it is not clear why speakers would preserve marked segments rather than unmarked

ones, going against the cross-linguistic tendency.

As de Lacy (2002, §6.4.2) argues, it is problematic to use marked-faithfulness ac-

counts to predict that coronals should delete in cases where labials and dorsals do

not delete. In this section I have shown that this approach is equally problematic in

accounting for the variable deletion rates of other stops, as it necessitates the con-

junction of marked faithfulness constraints, and even if such constraints are allowed

in the system, the predictions that follow from such accounts are incorrect.

2.3.3 Frequency accounts

There is evidence that information theoretic concerns (Shannon, 1948) affect linguistic

behavior. Zhao and Jurafsky (2009) report that associating word frequency with word

reduction goes back to observations made by Sibawayhi, an Arabic grammarian of the

8th century (Al-Nassir, 1993; Carter, 2004). Zipf (1929) claims that the reduction of

frequent linguistic elements follows from usage – frequent elements are under a greater

pressure to become efficient. Greater efficiency implies simplification and reduced

duration. Perhaps when the duration of segments is reduced too much deletion also

follows.

Reducing frequent segments can be interpreted as beneficial from a stricter in-

formation theoretic interpretation. Given no other information, frequent linguistic


elements are more likely to occur than less frequent elements, and can therefore be

more easily recovered by listeners. If some information is available, for instance if

the listeners know that they heard a voiceless stop, but do not know which one they

heard, guessing that it was a /t/ makes a better guess than /k/ or /p/, since /t/ is

more frequent than /k/ or /p/.

A simple way to measure the frequency of a segment σk in a language is to count

the number of times that segment was encountered in a subset of the language (2.2).

It is then possible to compare each segment’s frequency with the frequency of all

segments in the same subset (2.3). Since (2.3) is the same for all segments σi, it

is possible to always divide (2.2) by (2.3) to yield the maximum likelihood estimate

(MLE) of the probability of seeing σk, (2.4).

(2.2) The frequency σk

# (σk)

(2.3) The frequency of all segments, summed∑i

# (σi)

(2.4) The probability of σk

Pr(σk) ≈# (σk)∑i # (σi)

If frequency effects are extended from the reduction of words to the reduction of

segments, they can provide a good explanation for the differences among voiced stops

and voiceless stops. Table 2.3 shows the frequency of the nine stops across the Fisher

(Cieri et al., 2004, 2005), Switchboard Godfrey and Holliman (1997) and Buckeye

(Pitt et al., 2007) corpora of spoken English (see §2.8, side by side with their relative

duration and deletion ratios (as shown in tables 2.1 and 2.2). /b/ and /k/ are more

frequent than /g/ and /p/ respectively, and as predicted by the proposed frequency

account, their duration is shorter and they delete more frequently than /g/ and /p/.

However, this explanation does not account for the contrast between /m/ and /N/.

/m/ is more frequent than /N/, yet its duration is longer, and its relative deletion

ratio is lower than the deletion ratio of /N/.


Table 2.3: Buckeye word-medial stop duration, deletion and probability

Place Voiceless Stops Voiced Stops Nasal Stopsσ Dur. Del. Prob. σ Dur. Del. Prob. σ Dur. Del. Prob.

Labial /p/ 1.123 0.013 0.015 /b/ 0.805 0.113 0.017 /m/ 0.881 0.025 0.035Dorsal /k/ 1.032 0.020 0.031 /g/ 0.829 0.054 0.010 /N/ 0.773 0.046 0.011Coronal /t/ 0.775 0.160 0.074 /d/ 0.587 0.175 0.039 /n/ 0.773 0.072 0.062

2.3.4 Local predictability accounts

A local predictability-based account takes the “best guess” story one step further.

Consider the case of the -tion suffix in English in the word ‘explanation’. When

speakers reach the sequence [S@] they already know that [n] will follow. In other words,

hearing [n] after the sequence of segments that precede it does nothing except confirm

the speaker’s expectations that [n] will follow. In itself, /n/ provides no information.

It is reasonable, therefore, that speakers could delete it, without harming the listener’s

ability to comprehend the word.7 In information theoretic terms, hearing /n/ does

not provide any new information.

I will define the amount of information a segment σ provides in the context c

that it appears in using conditional probability as in (2.5). In order to transform the

conditional probability of seeing σ in the context c to amounts of information, I take

the negative log of (2.5) to yield (2.6), which is measured in bits. In the example

above, listeners are positive that [n] is going to follow, which makes their estimate of

Pr (n|[#Ekspl@neIS@) equal 1, and the amount of information provided by seeing /n/

in that context zero, as the log of 1 is 0 – no information is gained.

The same principle applies when seeing a segment in some context is very improb-

able or very probable. The higher the conditional probability of seeing a segment in

its context is, the less information it provides, and vice versa. From the functional

perspective, speaker can reduce or omit predictable segments with less harm to their

listeners, who can easily recover the reduced or omitted segments.

7In order to perform this kind of accommodation, the speaker is not required to have a completeknowledge of the listener’s state of mind. Minimal accommodation models in which speakers modelthemselves as listeners suffice.


(2.5) The conditional probability of seeing σ in context c

Pr (σ|c)

(2.6) The negative log probability of seeing σ in context c

− log2 Pr (σ|c)

Different theories may define the context c differently or not use it at all. It is pos-

sible to set c to provide varying levels of information. The simplest approach is to set c

not to provide any information. This is a simple transformation of segment probabil-

ity often labeled uniphone (2.7). A common approach, taken for instance in Raymond

et al. (2006), is to take one or two preceding segments as the context, yielding specific

measurements often labeled biphone (2.8) and triphone (2.9) respectively. Biphones

and triphones approximate a very local context that can approximate phonotactics

to a certain degree.8 Another approach, taken in van Son and Pols (2003), is to try

to approximate the amount of information that is associated with a segment at a

word-prediction level, by taking as context all the preceding segments (2.10). This

measurement is the one used above to describe the case of the final /n/ in the word

‘explanation’.

(2.7) Uniphone

− log2 Pr (σ)

(2.8) Biphone

− log2 Pr (σi|σi−1)

(2.9) Triphone

− log2 Pr (σi|σi−2σi−1)

(2.10) Information given all previous segments

− log2 Pr (σi|σ0 . . . σi−1)

8In English /Ä/ follows unstressed vowels in very few words, making words such as adventureratypical of English phonotactics.


In order to estimate the probability of seeing a segment in context, it is common to

use counts, which uses the same maximum likelihood approach that was used above

for calculating segment probabilities. The probability of seeing a phone in context is

estimated to be (2.11), the number of times the speakers encountered σ in the context

c over the number of times they encountered the context c.

(2.11) Estimate for conditional probability of seeing σ in the context of c

Pr (σ|c) ≈ #(σ, c)

#(c)

Local predictability in its various forms has been shown to affect linguistic per-

formance. Aylett and Turk (2006) show that syllable nuclei are shorter when they

are locally predictable from context. Similar studies demonstrate the same for other

levels of linguistic representation, such as consonants, morphemes and words (Pluy-

maekers et al., 2005; van Son and Pols, 2003; van Son and van Santen, 2005; Jurafsky

et al., 2001). Other studies link local predictability to syntactic planning (Levy and

Jaeger, 2007; Jaeger, 2010) and use it to provide a basis for markedness (Hume, 2008).

Adapting this view to segment duration and deletion ratios yields an expectation that

more predictable segments would have shorter duration and delete more frequently

than less predictable ones. This view is parallel to the view presented above for fre-

quent segments. Raymond et al. (2006) have shown such effects for the deletion of

word-medial /t/ and /d/ at syllable onsets (but surprisingly enough, the opposite

held in codas).9

Local predictability accounts for a broad range of phenomena, but it cannot be

completely correlated with the propensity of a segment to delete. While it is generally

true that segments are more likely to delete the more predictable they are, some

segments delete even when they are not locally predictable, while other segments do

not delete even when they provide no information. Consider the cases of /t/-deletion

in examples (2.12) and (2.13), and the cases of /d/-deletion in examples (2.14) and

(2.15) from the Buckeye corpus, where /t/ and /d/ delete even though they are rather

surprising.

9The experiments conducted in this study show no such reversal of predictability, but did notfocus on /t/ and /d/.


(2.12) ‘notice’ → [noU@s]

(2.13) ‘battle’ → [bæl"]

(2.14) ‘sudden’ → [s@n"]

(2.15) ‘order’ → [6@~]

The deleted /t/s and /d/s are unpredictable if context is measured from the

beginning of the word as in (2.16).10 The same holds for a context of two previous

segments (triphone), as in (2.17).

(2.16) Word Prob.

‘notice’: Pr(t|[#noU) = 0.011

‘battle’: Pr(t|[#bæ) = 0.0002

‘sudden’: Pr(d|[#s@) = 0.007

‘order’: Pr(d|[#6r) = 0.00032

(2.17) Word Prob.

‘notice’: Pr(t|noU) = 0.011

‘battle’: Pr(t|bæ) = 0.00029

‘sudden’: Pr(d|s@) = 0.01

‘order’: Pr(d|6r) = 0.0008

On the other hand, /p/ is the only segment that appears after [w@~kS6] in ‘work-

shops’, and /m/ is the only segment that can appear after [@kæd@] in ‘academy’ 11,

and even locally they are not very surprising (2.18)

(2.18) Word Prob.

‘workshops’: Pr(p|S6) = 0.398

‘academy’: Pr(m|d@) = 0.048

Similarly, the corpus contains just a single case of /m/-deletion out of 286 instances

of /m/ in words that begin with ‘home’, even though /m/ follows hoU in a ratio of 2:5,

10The # symbol stands for a beginning of a word.11That is, they are fully recoverable from previous context.


and despite the high frequency of the word ‘home’.12 Local predictability does not

predict the existence of /t/ and /d/ deletion processes in English, and the absence of

/p/ and /m/ deletion processes (though they still delete, but at lower rates; see table

2.2).

That predictability does not always coincide with reduced duration and higher

deletion rates is further demonstrated by the fact that deletion processes may re-

move segments that do carry information, contrary to expectation. Consider the

diachronic case of the deletion of French plural markers that led to the conflation of

the plural form of words such as pommes with their singular forms (pomme), or the

case of Puerto Rican Spanish (Hochberg, 1986) where /s/-deletion removes agreement

markers and conflates second and third person verb forms. While such local loss of

information may lead to compensation elsewhere (for instance, it may increase the

use of pronouns), it is hard to claim that deletion processes in language only serve to

improve communication.

Neither frequency-based explanations nor local predictability effects can adequately

explain the duration and deletion ratios of English stops. Frequency accounts cannot

explain why /N/’s duration is shorter and why it deletes more than /m/, and local

predictability accounts do not explain why unpredictable /t/s delete more frequently

than predictable /p/s. The next section will propose a way to bridge the gap be-

tween the frequency and local predictability accounts by introducing the concept of

segment informativity. Like frequency, segment informativity does not depend on lo-

cal context, and like local predictability, it emphasizes the importance of the varying

“usefulness” of various segments from an information theoretic perspective.

2.4 Informativity

There is a tension between frequency-based explanations and local predictability ac-

counts. Local predictability accounts view phonological changes as driven by func-

tional concerns: speakers delete or reduce uninformative elements. Frequency, in this

view, is only an approximation due to lack of knowledge about the context. When

speakers do not know what the context is, frequency functions as zero context “local”

12For discussions of reduction in frequent words see Bell et al. 2009.


predictability. But careful examination of the data shows that frequent segments may

also delete in cases where they remove information that is not locally reconstructible,

as is the case with /s/-deletion in Puerto Rican Spanish. Similarly, frequency-based

accounts fail to predict that /N/, which is usually very predictable from prior context

in English, will delete as frequently as it does, since it is a relatively infrequent seg-

ment. In other words, what both accounts fail to capture is that predictable segments

behave as predictable even when they are not: they “carry over” their being usually

predictable to contexts where they are unpredictable. This is the gap that segment

informativity tries to bridge.

I propose that language users record how useful and informative a segment usually

is in the language. This knowledge then forms an expectation of the utility of each

segment, which in turn affects the duration and deletion ratios of those segments.

This model predicts that less informative segments will have shorter durations and

be more likely to delete even when they are unpredictable given local context, other

things being equal, and that more informative segments will have longer durations

and be less likely to delete even when they are predictable given local context. Thus,

/d/ in ‘sudden’ may be deleted because /d/ is expected to provide less information,

even though it is locally unpredictable (and does provide information), as is the case

in (2.14), repeated here as (2.19).

(2.19) ‘sudden’ → [s@n"]

To approximate segment informativity, the segment’s negative log predictability

given some definition of context is taken (2.20), and averaged across every case in

which that segment appeared with any context, by summing over contexts, with each

context weighted by its co-occurrence with that segment (2.21). This averaging yields

the expected value of that segment’s negative log predictability.

(2.20) The local predictability of σ in context c

− log2 Pr (σ|c)


(2.21) The informativity of σ

−∑c

Pr (c|σ) log2 Pr (σ|c)

In this chapter I set local context to be every preceding segment in the same

word, adopting the view presented in van Son and Pols (2003), which points to word

recognition as the relevant task for the contribution a segment makes. When matched

against segment duration and deletion ratios, the different deletion hierarchies across

different stop types are approximated by each segment’s informativity: /p/ and /m/

are more informative and delete less than /k/ and /N/, but /b/ is less informative

and deletes more than /g/, as table 2.4 shows. Informativity was calculated using

the same corpora used to calculate segment frequency (see §2.8).

Table 2.4: Buckeye word-medial stop duration, deletion and informativity

Place Voiceless Stops Voiced Stops Nasal Stopsσ Dur. Del. Info. σ Dur. Del. Info. σ Dur. Del. Info.

Labial /p/ 1.123 0.013 3.656 /b/ 0.805 0.113 3.923 /m/ 0.881 0.025 2.437Dorsal /k/ 1.032 0.020 2.261 /g/ 0.829 0.054 4.693 /N/ 0.773 0.046 0.276Coronal /t/ 0.775 0.160 1.357 /d/ 0.587 0.175 1.632 /n/ 0.773 0.072 1.720

The informativity of segments predicts the asymmetry in the duration and deletion

rates of English stops. However, much of what informativity explains has been ad-

dressed before by current accounts. Local predictability may explain why /N/ deletes

more than /m/, which none of the other attempts could achieve. Moreover, seg-

ment informativity is highly correlated with other factors such as segment frequency,

segment local predictability and place of articulation. How can we tell whether infor-

mativity is significant in its own right given other explanatory variables? To ascertain

its contribution it is necessary to control for phonetic, phonological and other infor-

mation theoretic factors. In the next sections I test the effect informativity and other

information theoretic measurements have on segment duration and deletion ratios

using multivariate regression studies.


2.5 Segment duration and deletion studies

2.5.1 Studies overview

The puzzle of stop duration and deletion ratios is a convenient example for demon-

strating that the durations and occasional deletion ratios of segments are systematic –

some segments have shorter duration and delete more frequently than other segments.

Moreover, the observed durations and deletion ratios show that systematic duration

and deletion ratios of American English segments cannot be explained using only

phonetic explanations, as less audible segments such as /p/ (Ohala, 1981) delete less

frequently than the more audible voiceless stop /k/. Similarly, the duration of /g/ is

longer than the duration of /b/, even though it is more difficult to maintain voicing

for dorsal stops. Accounts which are based on segment frequency and segment local

predictability do not predict the observed durations and deletion ratios, as /m/ is

more frequent than /N/, but has longer duration and deletes less frequently than /N/,

and unpredictable /t/ and /d/ delete even when they are unpredictable in context,

while predictable segments do delete.

But in order to know how each factor affects segment duration and deletion ratios

it is necessary to weigh the effect each predictor has against the possible effect of other

factors. If informativity, frequency and local predictability have the expected effect

even after other factors have been controlled for, it would validate their explanatory

power beyond the American English stops puzzle.

Cohen Priva (2008) studied the residual effect informativity has on the deletion of

word-medial onsets and codas while controlling for other phonological, phonetic and

information theoretic factors. The studies were performed only on part of the Buckeye

corpus (Pitt et al., 2007), and suffered from multiple collinearities. Additionally, those

studies were not applied to the effect segment informativity has on the duration of

word-medial segments (though some of the segment duration data was explored in

Cohen Priva and Jurafsky 2008).

The following studies are designed to complement and address the shortcomings

of the previous studies in several ways:

• The information theoretic measurements are measured using a significantly

larger collection of spoken English (see §2.8).


• Data from all forty speakers in the Buckeye corpus is used to measure seg-

ment duration and deletion ratios, whereas previous studies used only twenty

speakers.

• The duration of segments is measured alongside deletion ratios.

• The collinearities between information theoretic measurements is removed.

• Raymond et al. (2006) and Cohen Priva (2008) tested the effect various factors

have on onsets and codas separately. This chapter takes a different approach.

Since the concepts ‘coda’ and ‘onset’ group together different phonological en-

vironments, this chapter studies the role of information theoretic variables in

intervocalic and postvocalic preconsonantal positions. Thus, there are fewer

phonological features to control for in each data set, and risk of collapsing the

difference between different phenomena is reduced.

The two environments are subject to different weakening pressures. Delet-

ing postvocalic preconsonantal segments simplifies syllable clusters by chang-

ing VCCV sequences to VCV sequences, simplifying CCV(C) syllables to less

marked CV(C) syllables and CVC syllables to unmarked CV syllables. In con-

trast, deleting intervocalic consonants complicates syllable structure as it cre-

ates marked onset-less syllables. Intervocalic consonants are subject to lenition

processes such as spirantization which do not always affect postvocalic positions

(American English tapping is one such case, Kahn 1976).

• Liquids and glides are excluded as there are too few of them for the number of

controls that are required to describe the differences among them. Therefore

the chance of overfitting the data is decreased.

2.5.2 Intervocalic consonant duration

Introduction As summarized above, intervocalic positions are often the locus of

various lenition processes such as spirantization and sonorization. In American En-

glish one such outcome is tapping. However, it is not a typical environment for

deletion, as the data set used in these study shows: 4.2% of the segments in inter-

vocalic positions delete, while 5.9% of the segments in postvocalic preconsonantal


positions delete (Fisher’s Exact Test, p < 0.001). Intervocalic positions are therefore

a good test case to investigate what effect information theoretic variables have on

segment duration, and subsequently on segment deletion.

Method and materials The underlying representation of every word in the Buck-

eye corpus was matched with its actual pronunciation as described in §2.7. The

duration of matched segments was recorded, and kept alongside the word they were

part of and their phonological environment. Rate of speech was calculated as the

number of underlying segments per second. I excluded all deleted segments and seg-

ments with unusually long or short duration (top and bottom 2.5% of each segment).

This procedure yielded 27,353 observations.

I used linear regression to calculate the log duration of segments. This means that

the regression attempts to fit the weights β0 . . . βn in the formula in (2.22), where y is

a vector of observed durations, xi..n are vectors of predictors, and ε stands for possible

noise. The formula can be further simplified to (2.23) and (2.24). The formula in

(2.24) shows that the regression is a regression of multipliers – a binary feature whose

coefficient equals 0.1 does not indicate that the duration is 0.1 milliseconds longer

when this feature is ‘true’, but rather that the duration is e0.1 times as long when

the feature is ‘true’ than it would have been had this feature been ‘false’. A zero

coefficient therefore would not affect the duration since e0 = 1. Significance is tested

by estimating how likely it is that the coefficient of a variable is really positive or

negative (that it is not zero).

(2.22)

log (y) ≈ β0 + β1x1 + β2x2 + . . .+ βnxn + ε

(2.23)

elog(y) ≈ eβ0+β1x1+β2x2+...+βnxn+ε

(2.24)

y ≈ eβ0 · eβ1x1 · eβ2x2 · . . . · eβnxn · eε

I used the phonological control variables in table 2.5 to control for the base proper-

ties of the segment in question. In addition, I used the phonological control variables


in 2.6 to control for the properties of the neighboring vowels and word-level variables.

Phrases were taken to be the Buckeye Corpus speakers’ turns (as divided by the

Buckeye Corpus annotators).

Table 2.5: Segment properties

Variable Value Segments

Manner stop /p, t, k, b, d, g/affricate /Ù, Ã/nasal /m, n, N/fricative /f, v, T, D, s, z, S, Z, h/liquid /l, r/glide /w, j/

Place glottal /h/labial /p, b, v, f, m/dorsal /k, g, N, j/coronal all others

Subplace dental /T, D/post-alveolar /Ù, Ã, S, Z/∅ all others

Voicing voiced (binary) /b, v, D, d, z, Ã, Z, g/

I used the step() function (Hastie and Pregibon, 1992; Venables and Ripley,

2002) in R (R Development Core Team, 2012) to allow the best non information

theoretic model to be chosen automatically, and then added four information theo-

retic variables of interest: word frequency, segment probability (uniphone), segment

informativity and the local predictability of the segment, all defined in table 2.7, and

estimated from spoken corpora following a procedure detailed in §2.8. It is important

to note that informativity is residualized using uniphone as the baseline and that

local predictability is residualized using both uniphone probability and informativ-

ity.13 Thus, these factors will only be significant if they improve the model beyond

13This means that residual informativity is the original value of informativity minus an approxi-mation of informativity using frequency in a linear regression of the form informativity(segment) ≈intercept + uniphone(segment), and local predictability is similarly residualized using a formula ofthe form negative log predictability(segment in context) ≈ intercept + frequency(segment) + infor-mativity(segment).


Table 2.6: Environment properties

Variable ValueStress neighboring vowel has primary, secondary or no stressPOS duration the median duration of all segments with the same part of

speechPhrase distance distance from the end of the phrase in words, loggedStart position distance from the beginning of the word in segments, loggedEnd position distance from the end of the word in segments, logged

the (unconstrained) effect variables they are residualized over have.

Table 2.7: Variables of interest

Variable ValueWord frequency the frequency of the word, loggedSegment probability the negative log unigram probability of observing

the segment (2.7)Segment informativity the informativity of the segment (2.21) using all

earlier segments in the same word as context resid-ualized using segment probability

Segment localpredictability

the negative log local predictability of the segment(2.10) using all earlier segments in the same wordas context residualized using segment probabilityand segment informativity

The model was reevaluated using a mixed effects model with the identity of

the word as a random effect using R’s lme4 package (Bates et al., 2011). The

pMCMC-values reported below were evaluated using the function pval.fnc() from

R’s languageR package (Baayen, 2011). pMCMC values are computed using Markov

chain Monte Carlo sampling of the data and are more conservative than p values. See

Baayen et al. (2008) for further details.


Results All four variables of interest affected the duration of intervocalic conso-

nants. High word frequency predicted shorter segment duration (pMCMC < 0.001),

as predicted by many previous studies (Zipf, 1949; Bell et al., 2009, among oth-

ers). High uniphone score (negative log segment probability) was correlated with

longer duration (pMCMC < 0.001). Similarly, residual informativity was correlated

with longer duration (pMCMC < 0.001), and so was residual local predictability

(pMCMC < 0.001).14

Among the control variables, the duration of segments that followed stressed vow-

els was shorter (pMCMC < 0.001), but the duration of segments that preceded

stressed vowels were longer (pMCMC < 0.001), an asymmetry that is explained by

American English tapping patterns – /t,d/ tap in intervocalic contexts that follow

stressed syllables and precede unstressed syllables (Kahn, 1976), and the duration of

taps ([R]) is shorter than the duration of [t] and [d]. The duration of segments was

significantly affected by their part of speech (pMCMC < 0.01). Finally, segments

were shorter the further they were from phrase-final position, as predicted by end-of-

phrase lengthening. For a complete list of control variables and their effect, please

see §2.9.1.

Discussion The results show the role of information theoretic measurement in af-

fecting the duration of segments. After controlling for phonological features such as

place of articulation and phonetic properties such as rate of speech, the duration of

segments that had high uniphone score (low probability), high informativity and that

were unpredictable in their context was relatively longer. The durational modulation

of segments is affected by both segment-level information theoretic factors such as

informativity and by word-level information theoretic factors such as word frequency.

These results establish the importance of segment informativity as they show

that informativity affects not only deletion ratios, but also the duration of segments.

Reduced duration may in turn lead to deletion, which is the focus of the following

section.

14 Word frequency is measured in log number of words in several corpora. Higher frequency yieldshigher log frequency. Frequent, predictable and low-informativity segments provide less informationand therefore have lower uni/bi/triphone or local predictability or informativity scores. They aremeasured using negative log (conditional) probability.


2.5.3 Intervocalic segment deletion

Introduction The previous study shows that the duration of segments is affected

by information theoretic factors. But diminished duration does not have to lead to

deletion. Deletion may happen when the duration of a segment is reduced beyond

the minimal duration that would allow the articulators to pronounce it. For different

segments the threshold beyond which deletion would occur may be different. Deletion

may therefore be independent of durational modulation, and independently driven.

Another important property of deletion processes is that durational modulation

is not considered to be part of competence grammar.15 Though occasional deletion of

segments is not considered to be part of competence grammar either, it is very likely

that what eventually becomes part of competence grammar begins as frequent occa-

sional deletion that subsequently get grammaticalized and preserved in the grammar.

Thus, though word-medial /t/-deletion is not part of English grammar, word-final

/t/-deletion is allowed in many English dialects. In contrast, similar processes such

as k-deletion are not allowed in either environment. It is not far-fetched to claim

that the occasional deletion of word-medial /t/ is related to word-final deletion, and

perhaps driven underlyingly by similar factors.

It is therefore important to verify that information theoretic factors affect the

deletion ratios of segments in the same environment in which durational effects were

found.

Method and materials The corpus used in this study was aligned using the same

procedure described in the previous study and in §2.7. The control variables used in

the previous study were used in this study as well. I used the phonological control

variables in table 2.5 to control for the base properties of the segment in question.

Finally, I used the phonological control variables in 2.6 to control for the properties

of both vowels and word-level variables. Since deleted segments were not excluded

the same procedure yielded 30,052 observations.

I used logistic regression to estimate how each factor affects the likelihood of seg-

ments to delete. The best model that excludes information theoretic factors was

15Several exceptions to this generalization do exist. Consonant and vowel duration is contrastivein many languages. This set of studies focuses on the duration of obstruents, which is not contrastivein English.


chosen using the step() function in R. Subsequently, I added the same four informa-

tion theoretic variables of interest, as defined in table 2.7, and estimated from spoken

corpora using the CMU dictionary (Weide, 1998) for phonemic (underlying) repre-

sentation and the Switchboard (Godfrey and Holliman, 1997), Fisher (Cieri et al.,

2004, 2005) and Buckeye (Pitt et al., 2007) corpora for word counts.16 The model

was reevaluated using a mixed effects model with the identity of the word as a ran-

dom effect using R’s lme4 package (Bates et al., 2011). pMCMC-values (Baayen

et al., 2008) are not reported here since lme4 does not allow them to be computed

for logistic regressions. Instead, p-values are reported.

Results As expected, high word frequency increased the likelihood to delete (p

< 0.001), and high informativity score decreased the likelihood to delete. Uniphone

segment probability (its negative log probability) did not affect the segment’ likelihood

to delete, and neither did its residual contextual predictability.

Among the control variables, previous stress did not affect the segments’ likelihood

to delete but the following stress did – segments that were followed by stressed vowels

were less likely to delete (p < 0.001). The duration of segments was significantly

affected by their part of speech (p < 0.001).17 Finally, segments in words that were

more distant from phrase-final positions were more likely to be deleted. (p < 0.001).

For a complete list of control variables and their effect, please see §2.9.2.

Discussion The lack of effect for the segment’s uniphone probability and local pre-

dictability may be due to the greater number of observations required in a logistic

regression than in a linear regression, which may also have caused the effect of pre-

ceding stressed vowel to disappear. However, there may be other reasons. As I noted

before, deletion and reduction in duration do not necessarily have identical causes. It

is therefore important that in both the duration study and the deletion study, both

the amount of information the segment carries and the amount of information the

word carries affected segment duration and deletion in the expected direction: the

16For further details, see §2.8.17The coefficient for part of speech has to be ≥ 0. The variable contains for each segment the log

mean duration of segments that have the same part of speech. If the part of speech were irrelevantto segment duration, the coefficient would have been close to 0.


more information the word or the segment held, the longer their duration was and

the less likely they were to be deleted.

2.5.4 Postvocalic segment duration

Introduction The goal of this study and the subsequent study is to verify that

the information theoretic effects found for intervocalic consonants persist in other

environments. This environment replaces the coda environment used in Raymond

et al. (2006), Cohen Priva (2008) and Cohen Priva and Jurafsky (2008), even though

some of the segments in such positions are actually the first consonant in a complex

onset. For instance, CELEX (Baayen et al., 1995) treats the /s/ in estrange and the

/p/ in appreciate as the first consonant of the second syllable in both words.

The expectation is that the correlation between duration and information that was

observed in intervocalic context would be replicated in postvocalic preconsonantal

environment as well.

Method and materials The procedure used in this study is very similar to the one

used in the intervocalic duration study described in §2.5.2. The procedure described in

§2.7 was used to align the dictionary representations of the Buckeye corpus with their

actual pronunciation. Segment duration, rate of speech and phonological properties

were collected as detailed above. I excluded all deleted segments and segments with

unusually long or short duration (top and bottom 2.5% of each segment). This

procedure yielded 35,081 data points.

As in the previous duration study, I used linear regression to calculate the log du-

ration of segments. I used the phonological control variables in table 2.5 to control for

the base properties of the segment in question. I used the same phonological control

variables for the following consonant, as well as a variable that indicates whether the

two consonants share a place of articulation. Finally, I used the phonological control

variables in 2.6 to control for the properties of the preceding vowel and word-level

variables.

I used the step() function to allow the best non information theoretic model to be

chosen automatically, and then added the same four information theoretic variables

of interest: word frequency, segment probability (uniphone), segment informativity


and the local predictability of the segment, all defined in table 2.7, and estimated

from spoken corpora following a procedure detailed in §2.8.

The model was reevaluated using a mixed effects model with the identity of the

word as a random effect using R’s lme4 package. The pMCMC-values reported below

were evaluated using the function pval.fnc() from R’s languageR package.

Results The results were similar to the intervocalic consonant duration case. All

four variables of interest affected the duration of intervocalic consonants in the pre-

dicted direction. High word frequency predicted shorter segment duration (pMCMC

< 0.001), high uniphone score was correlated with longer duration (pMCMC <

0.001), and so was informativity (pMCMC < 0.001) and local predictability (pMCMC

< 0.05).

The control variables followed a similar pattern to the intervocalic duration study.

The duration of segments that followed stressed vowels was shorter (pMCMC <

0.001). The duration of segments was significantly affected by their part of speech

(pMCMC < 0.001). Segments were shorter the further they were for phrase-final

position, as predicted by end-of-phrase lengthening. For a complete list of control

variables and their effect, please see §2.9.3.

Discussion The results provide further support for the intervocalic consonant du-

ration study. The same variables had an identical effect, which shows that the corre-

lation between duration and information was not environment-specific, but rather a

fundamental property of American English segments.

2.5.5 Postvocalic segment deletion

Introduction As with the intervocalic case, it is not necessary that reduced du-

ration would lead to deletion. As I argued above, high ratios of occasional deletion

patterns are arguably a necessary step before optional and obligatory deletions are

encoded in speakers’ competence grammar.

I replicate the intervocalic study of consonant deletion in postvocalic preconsonan-

tal positions to see whether in this environment, too, high information is correlated

with reduced likelihood to delete.


Method and materials The corpus used for this study is identical to the one used

for the duration study in the same environment, except that deleted segments were

not excluded. The control variables used in the previous study were used in this

study as well. I used the phonological control variables in table 2.5 to control for

the base properties of the segment in question. I used the same phonological control

variables for the following consonant, as well as a variable that indicates whether

the two consonants share a place of articulation. Finally, I used the phonological

control variables in 2.6 to control for the properties of the preceding vowel and word-

level variables. Since deleted segments were not excluded the same procedure yielded

39,265 observations.

I used logistic regression to estimate how each factor affects the likelihood of seg-

ments to delete. The best model that excludes information theoretic factors was

chosen using the step() function in R. Subsequently, I added the same four informa-

tion theoretic variables of interest, as defined in table 2.7. The model was reevaluated

using a mixed effects model with the identity of the word as a random effect using

R’s lme4 package. pMCMC-values are not reported here since lme4 does not allow

them to be computed for logistic regressions. Instead, p-values are reported.

Results Three of the information theoretic variables affected segments’ likelihood

to be deleted. High word frequency predicted higher probability to be deleted (p <

0.001), high uniphone score decreased the likelihood to be deleted (p < 0.05), and

high informativity decreased segments’ likelihood to be deleted (p < 0.05). However,

there was no residual effect for local predictability.

The control variables followed a pattern similar to the previous study. Segments

that followed a vowel with primary stress were more likely to be deleted (p < 0.001)

than those that followed unstressed vowels, but those that followed vowels with sec-

ondary stress were not different from unstressed vowels. Segments’ likelihood to be

deleted was significantly affected by their part of speech (pMCMC < 0.001). Seg-

ments were more likely to be deleted the further they were from phrase-final position.

For a complete list of control variables and their effect, please see §2.9.4.

Discussion Segment deletion in postvocalic positions followed a similar pattern to

segment duration in the same environment for both the information theoretic factors


and the various controls. The repeated similarity to duration studies suggests that

the effect information theoretic variables have on segment duration and deletion ratios

is consistent. High amount of information at both the word level and segment level

leads to increased duration and preservation whereas low amount of information is

associated with reduction in duration and ultimately deletion.

The significance of most variables was lower across the board compared with the

previous study. The absence of significance for some variables may be due to the

loss of predictive power in logistic regressions and not due to fundamentally different

factors.

2.6 Conclusion

Both the theoretical analysis of oral and nasal stop deletion ratios in English and

the data-driven experiments of medial consonant deletion demonstrate the appeal

of an approach that uses a context-independent measurement for phone usefulness

in explaining the typology of consonant deletion processes. However, this out-of-

context measurement emerges from contextual considerations: speakers are biased by

how useful a phone usually is, regardless of the context in which it appears. There

is no clear functional justification for using this aggregate instead of the in-context

predictability, which suggests that informativity becomes part of the knowledge kept

about each phone. This provides us with a relatively rare view into the relationship

between functional and non-functional considerations in human language. Functional

considerations (in this case local predictability) shape a mental representation (in this

case phone informativity), which is then reflected back in language usage.

Previous accounts cannot provide a comprehensive explanation for the in-language

typology of consonant deletion. The different durations and deletion ratios of Ameri-

can English segments cannot be explained by phonetic or phonological reasons such as

markedness or articulatory and perceptual biases. Frequency and local predictability

predict some of the variance that phonological theories do not account for, but do

not explain why phones can be both infrequent and likely to delete, or completely

predictable yet stable. At the same time, previous information-theoretic measures

such as local predictability cannot be dismissed, as the various models supported the


use of local predictability and segment frequency. Informativity complements current

accounts by providing a mechanism that accounts for the shortcomings of current

theories.

This chapter revisits and expands on the findings of Cohen Priva (2008) and Co-

hen Priva and Jurafsky (2008). It establishes the importance of information content

and in particular segment informativity to performance-related phonetic and phono-

logical phenomena, the duration and deletion ratios of segments. As already hinted

in this chapter, the next step is to see how the same factors lead to competence-based

phenomena such as the actuation of phonological processes, and subsequently how

such processes are encoded in the lexicon and reflected in usage preferences.

2.7 Segment Alignment Process

The Buckeye corpus (Pitt et al., 2007) provides for each word several values: the

speaker of the word, the duration of the word, the word in English (2.25), the dictio-

nary (idealized) phonetic form (2.26), the word’s actual pronunciation (2.27) and its

part of speech (2.28)

(2.25) notice: notice

(2.26) n ow t ih s: /n>oUtIs/

(2.27) n ow ah s: [noU@s]

(2.28) VBP: present tense verb, not 3rd person

Part of the challenge in using the corpus is to have a disciplined way to under-

stand that in (2.27), /t/ was dropped by the speaker, that /I/ surfaced as [@], and

that the other segments remained unchanged. One way to align two strings together

is to minimize a metric of edit-distance between the two strings, and use the list of

edits that yielded the minimal edit-distance. One such way is to minimize the Lev-

enshtein distance (Levenshtein, 1966). A simple calculation of edit distance between

strings requires that every operation of the list: substitution, deletion and insertion

is associated with an identical penalty. Thus, the (minimal) edit distance between


the strings “bests” and “guest” is 3: ‘b’ was replaced with ‘g’, ‘u’ was inserted and

‘s’ was deleted.18

It is not advisable to use equal penalties for the three edit operations because some

substitutions are phonologically plausible and other substitutions are not motivated.

The penalties for each insertion, deletion and substitution were therefore modified

to reflect phonological plausibility. I used the penalties in table 2.8 to align the

underlying representations with their surface forms. The results of the alignment

process (the number of each segment aligned with each other segment, excluding

vowels) can be found at the end of this section.

Table 2.8: Dictionary and surface alignment penalties

Source Target Penaltysegment same segment 0.0segment ∅ (deletion) 1.0∅ segment (insertion) 1.0vowel another vowel 0.3nasal another nasal 0.3dorsal stop another dorsal stop 0.3sibilant another sibilant 0.3/t,d,T,D/ another /t,d,T,D/ 0.3/l,l

"/ another /l,l

"/ 0.3

/r,Ä/ another /r,Ä/ 0.3/t,d,P,R/ another /t,d,P,R/ 0.3/p,b,f,v/ another /p,b,f,v/ 0.3/m,b/ another /m,b/vowel non-vowel 5.0non-vowel vowel 5.0

In order to get reliable predictability scores, the Switchboard (Godfrey and Hol-

liman, 1997) and Fisher (Cieri et al., 2004, 2005) corpora were used to provide word

counts in addition to the Buckeye corpus. For many of these words the Buckeye

18If two routes have the same penalty, it is not defined which one is better. The (minimal) edit-distance between “ab” and “bc” is 2, but there are two ways to get to that value: substitute ‘b’ for‘a’ and ‘c’ for ‘b’ (1+1) or delete ‘a’ and insert ‘c’ (1+1). In these cases the algorithm arbitrarilydecides between the two.


corpus did not provide dictionary representations and the CMU dictionary was used

instead. The Buckeye dictionary representation is similar to the CMU representation

but they are not identical. The substitutions in table 2.9 were allowed, and the word

was allowed to have its CMU representation. Other substitutions meant that the

word was excluded from the data.

Table 2.9: Buckeye to CMU valid substitution

CMU Buckeyeany vowel any vowel/s/ /S//S/ /s//s/ /z//z/ /s//t/ /d//d/ /t//ô/ /Ä//Ä/ /ô/any vowel + /l,m,n,ô/ /l

",m",n",Ä/ respectively

Finally the word-level files listed above had to be aligned with the segment-level

files which contained segment duration. Segments in either file that did not have an

equivalent in the other file were removed from the data.


phone aa ae ah ao aw ay b ch d dh dx eh ehn el en er

b 0 0 0 0 0 0 3877 0 0 0 0 0 0 0 0 0

ch 0 0 0 0 0 0 0 634 0 0 0 0 0 0 0 0

d 0 0 0 0 0 0 1 1 2683 10 1361 0 0 0 0 0

dh 0 0 0 0 0 0 0 0 0 1282 29 0 0 0 0 0

f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

hh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

jh 0 0 0 0 0 0 0 15 1 0 1 0 0 0 0 0

k 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

l 0 0 0 0 0 0 0 0 0 1 1 0 0 345 0 0

m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

n 0 0 0 0 0 0 0 0 4 0 1 0 0 0 648 0

ng 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

p 0 0 0 0 0 0 112 0 0 0 0 0 0 0 0 0

r 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 364

s 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0

sh 0 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0

t 0 0 0 0 0 0 1 434 516 2 2820 0 0 0 0 0

th 0 0 0 0 0 0 3 0 5 137 3 0 0 0 0 0

v 0 0 0 0 0 0 10 0 0 0 1 0 0 0 0 0

w 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

z 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

zh 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

phone f g hh jh k l m n ng nx p r

b 0 0 0 0 0 0 11 0 0 0 11 2

ch 0 0 0 0 1 0 0 0 0 0 0 0

d 0 2 0 15 0 1 2 10 0 2 0 2

dh 0 0 0 0 0 1 0 0 0 0 0 0

f 1862 0 0 0 0 0 0 0 0 0 4 0

g 0 1446 0 0 18 0 0 0 3 0 0 0

hh 0 0 324 0 0 0 0 0 0 0 0 0

jh 0 0 0 760 0 0 0 0 0 0 0 0

k 0 96 4 0 8208 0 0 0 0 0 0 0

l 0 0 0 0 0 10415 1 2 0 1 0 1

m 0 0 0 0 0 0 5226 14 0 8 2 0

n 0 0 0 0 0 1 100 17020 42 3082 0 2

ng 0 1 0 0 1 0 0 62 2777 1 0 0

p 13 0 0 0 0 0 0 0 0 0 4680 0

r 0 0 1 1 0 1 0 0 0 0 0 12763

s 0 0 0 0 1 0 0 0 0 0 0 1

sh 0 0 0 0 0 0 0 0 0 0 0 0

t 0 0 1 1 0 1 0 5 0 3 2 2

th 3 0 0 0 0 1 1 3 0 0 3 0

v 36 0 0 0 0 0 0 0 0 0 0 2

w 0 0 0 0 0 0 0 0 0 0 0 0

y 0 0 0 0 0 0 0 0 0 0 0 0

z 0 0 0 0 0 0 0 0 0 0 0 0

zh 0 0 0 0 0 0 0 0 0 0 0 0


phone s sh t th tq v w y z em zh

b 0 0 0 0 0 91 5 0 0 0 0

ch 0 103 4 0 0 0 0 1 0 0 1

d 0 0 23 0 11 0 0 0 0 0 2

dh 0 0 0 2 0 0 0 0 0 0 0

f 0 0 0 0 0 11 0 0 0 0 0

g 0 0 0 0 0 0 0 1 0 0 0

hh 0 0 0 0 0 0 0 0 0 0 0

jh 0 8 0 0 0 0 0 0 3 0 50

k 0 0 1 0 1 0 0 0 0 0 0

l 0 0 0 1 0 0 10 0 0 0 0

m 0 0 0 0 0 0 1 0 0 180 0

n 1 0 35 0 2 0 0 0 0 6 0

ng 0 0 0 0 0 0 0 0 0 0 0

p 0 0 0 0 1 31 0 0 0 0 0

r 1 0 0 0 0 0 0 1 0 0 1

s 9141 213 2 0 0 0 0 0 107 0 8

sh 4 1513 0 0 0 0 0 0 2 0 4

t 7 13 6537 2 576 0 0 1 3 0 1

th 0 0 44 1181 20 0 0 0 1 0 0

v 0 0 0 0 1 4092 2 0 0 0 0

w 0 0 0 0 0 0 1658 0 0 0 0

y 0 0 0 0 0 0 0 1089 0 0 0

z 311 2 0 0 0 0 0 0 1617 0 14

zh 0 4 0 0 0 0 0 0 2 0 84

2.8 Calculating information theoretic measurements

In order to calculate the frequency, predictability and informativity of each segment,

I used several corpora of spoken American English. I collected word counts from the

Switchboard (Godfrey and Holliman, 1997), Fisher (Cieri et al., 2004) and Buckeye

(Pitt et al., 2007) corpora. Each word was assumed to have its phonetic representation

in the CMU dictionary (Weide, 1998). The following information theoretic variables

were assessed using maximum likelihood estimates from the corpora.

• Word frequency is the number of times the word appeared in the corpora.

• Segment probability is the number of times the segment appeared in the dictio-

nary representation of a word that appeared in the corpora (ignoring deletions,

lenitions etc.). The negative log (base 2) of segment probability is taken. This is

the number of bits of information that the segment holds if no other information

is known.

• Segment predictability is the number of times a segment appeared following

the segments that precede from the beginning of the word, over the number


of occurrences that the segments that precede it appeared with any following

segment (or ended without any segment following). In (2.29) the corpus contains

just three words. The predictability of /s/ in the word talks is 0.25, as /s/ follows

the talk- prefix once for every four occurrences of the prefix. The negative log

(base 2) of predictability is taken. This is the number of bits of information

that the segment holds if no other information is known except the preceding

segments.

(2.29) Sample word counts:

talk 200

talks 100

talking 100

The predictability of /s/ in the word talks :

100

200 + 100 + 100= 0.25

The negative log predictability of /s/ in the word talks is − log2 0.25 = 2.

• Segment informativity is the weighted average of the negative log predictability

of all the occurrences of a segment. In (2.30) the corpus consists of six words,

and /s/ appears twice, once in the word talks in which it follows talk- with

probability of 0.25 (− log2 0.25 = 2), and once in the word walks in which

it follows walk- with probability of 0.5 (− log2 0.5 = 1). /s/ appears in talks

100 times, and in walks 300 times. The informativity of /s/ in this corpus is

(100 ∗ 2 + 300 ∗ 1) / (100 + 300) = 1.25.

(2.30) Sample word counts:

talk 200

talks 100

talking 100

walk 150

walks 300

walking 150


The predictability of /s/ in the word walks :

300

150 + 300 + 150= 0.5

The negative log predictability of /s/ in the word walks is − log2 0.5 = 1.

• Residual segment informativity : informativity is very collinear with frequency.

In order to remove that collinearity, segment informativity is residualized using

segment probability. Suppose that for all the observations, the informativity

of all segments is ~y and the (negative log) probability is ~x. A linear regression

is performed, which fits ~y ≈ a~x + b. The predicted value of this regression is

predicted (~y) = a~x+ b (the values of a and b are fitted to best fit the predictions

of the regressions to ~y). Rather than use ~y to approximate informativity in

subsequent regressions, ~y− predicted (~y) is used, thereby leaving only that part

of informativity which is not explained by segment probability.

• Residual segment predictability : both informativity and frequency are very

collinear with negative log predictability. In order to remove that collinearity,

negative log predictability is residualized using negative log probability and in-

formativity. Suppose that for all the observations, the negative log predictability

of all observations is ~z, the informativity of the observations is ~y and the negative

log probability is ~x. A linear regression is performed, which fits ~z ≈ a~x+ b~y+ c

(the values of a, b and c are fitted to best fit the predictions of the regressions

to ~z). The predicted value of this regression is predicted (~z) = a~x + b~y + c.

Rather than use ~z to approximate negative log predictability in subsequent re-

gressions, ~z − predicted (~z) is used, thereby leaving only that part of negative

log segment predictability which is not explained by informativity or negative

log probability.


2.9 Deletion and duration models

2.9.1 Intervocalic segment duration model

Factor Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(> |t|)Intercept -2.7052 -2.6631 -2.8509 -2.4613 0.0001 0.0000

rate of speech (log) -0.3285 -0.3304 -0.3483 -0.3126 0.0001 0.0000

manner: nasal -0.6809 -0.6673 -0.6894 -0.6451 0.0001 0.0000

manner: stop -0.5599 -0.5366 -0.5561 -0.5175 0.0001 0.0000

manner: affricate -0.0217 -0.0167 -0.0590 0.0279 0.4460 0.3736

primary stress vowel next 0.1755 0.1857 0.1619 0.2091 0.0001 0.0000

secondary stress vowel next 0.0946 0.1091 0.0825 0.1383 0.0001 0.0000

seg. is voiced -0.3690 -0.3746 -0.3929 -0.3566 0.0001 0.0000

seg. POA: coronal 0.7533 0.7475 0.6748 0.8228 0.0001 0.0000

seg. POA: dorsal 1.0556 1.0412 0.9690 1.1181 0.0001 0.0000

seg. POA: labial 0.8054 0.7900 0.7199 0.8619 0.0001 0.0000

seg. subplace: dental -0.5154 -0.5162 -0.5586 -0.4747 0.0001 0.0000

seg. subplace: palatal -0.2114 -0.2172 -0.2743 -0.1551 0.0001 0.0000

base duration for POS (log) 0.0580 0.0785 0.0267 0.1325 0.0042 0.0660

primary stress vowel precedes -0.1309 -0.1264 -0.1462 -0.1050 0.0001 0.0000

secondary stress vowel precedes -0.2240 -0.2178 -0.2444 -0.1900 0.0001 0.0000

distance for word end (log) 0.0510 0.0468 0.0262 0.0660 0.0001 0.0000

distance from end of phrase (log) -0.0205 -0.0206 -0.0246 -0.0166 0.0001 0.0000

distance from word start (log) -0.0010 -0.0090 -0.0330 0.0181 0.4842 0.9447

distance from start of phrase (log) 0.0140 0.0138 0.0094 0.0182 0.0001 0.0000

word frequency (log) -0.0081 -0.0089 -0.0133 -0.0045 0.0001 0.0047

seg. probability (log) 0.1343 0.1400 0.1261 0.1532 0.0001 0.0000

seg. informativity 0.0999 0.0967 0.0865 0.1071 0.0001 0.0000

seg. predictability (log) 0.0083 0.0080 0.0043 0.0116 0.0001 0.0001


2.9.2 Intervocalic segment deletion model

Estimate Std. Error z value Pr(> |z|)Intercept -9.1568 0.8661 -10.5722 0.0000

rate of speech (log) 2.5127 0.1672 15.0255 0.0000

primary stress vowel next -2.0467 0.1557 -13.1488 0.0000

secondary stress vowel next -1.1507 0.1836 -6.2666 0.0000

manner: nasal 1.9423 0.1772 10.9623 0.0000

manner: stop 1.8325 0.1711 10.7124 0.0000

manner: affricate 11.7542 189.5645 0.0620 0.9506

base deletion ratio for POS (log) 0.6699 0.0544 12.3206 0.0000

seg. POA: coronal -3.9470 0.4161 -9.4858 0.0000

seg. POA: dorsal -5.1174 0.4343 -11.7838 0.0000

seg. POA: labial -4.0775 0.3813 -10.6943 0.0000

seg. is voiced 0.6088 0.1154 5.2745 0.0000

distance from word start (log) 1.0311 0.1109 9.2935 0.0000

distance from end of phrase (log) 0.2736 0.0348 7.8542 0.0000

distance for word end (log) 1.0864 0.0941 11.5499 0.0000

seg. subplace: dental 1.1162 0.2910 3.8355 0.0001

seg. subplace: palatal -13.4598 189.5624 -0.0710 0.9434

word frequency (log) 0.2603 0.0223 11.6516 0.0000

seg. probability (log) -0.1632 0.1116 -1.4625 0.1436

seg. informativity -0.2852 0.0907 -3.1439 0.0017

seg. predictability (log) -0.0646 0.0278 -2.3268 0.0200


2.9.3 Postvocalic segment duration model

Factor Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(> |t|)Intercept -0.9082 -0.9158 -1.1088 -0.7221 0.0001 0.0000

rate of speech (log) -0.5007 -0.5001 -0.5174 -0.4829 0.0001 0.0000

manner: nasal -0.4795 -0.4710 -0.4952 -0.4467 0.0001 0.0000

manner: stop -0.3601 -0.3534 -0.3831 -0.3238 0.0001 0.0000

manner: affricate 0.2568 0.2645 0.1349 0.3927 0.0002 0.0003

base duration for POS (log) 0.0837 0.0967 0.0425 0.1503 0.0001 0.0046

distance from end of phrase (log) -0.0861 -0.0857 -0.0900 -0.0814 0.0001 0.0000

primary stress vowel precedes -0.1404 -0.1417 -0.1567 -0.1261 0.0001 0.0000

secondary stress vowel precedes -0.1437 -0.1412 -0.1718 -0.1102 0.0001 0.0000

seg.: voiced -0.4137 -0.4128 -0.4436 -0.3827 0.0001 0.0000

next seg.: voiced 0.1785 0.1807 0.1628 0.2001 0.0001 0.0000

next seg. manner: glide 0.2221 0.2303 0.1909 0.2685 0.0001 0.0000

next seg. manner: liquid 0.2561 0.2471 0.2194 0.2763 0.0001 0.0000

next seg. manner: nasal 0.0118 0.0135 -0.0355 0.0601 0.5892 0.6595

next seg. manner: stop -0.0713 -0.0743 -0.0914 -0.0569 0.0001 0.0000

next seg. manner: affricate -0.0726 -0.0758 -0.1208 -0.0275 0.0016 0.0071

seg. POA: dorsal -0.0434 -0.0429 -0.0809 -0.0039 0.0334 0.0476

seg. POA: labial -0.0872 -0.0970 -0.1311 -0.0649 0.0001 0.0000

distance from word end (log) -0.0669 -0.0706 -0.0899 -0.0532 0.0001 0.0000

seg. subplace: dental -0.5014 -0.5176 -0.6524 -0.3831 0.0001 0.0000

seg. subplace: palatal -0.1672 -0.2036 -0.3175 -0.1008 0.0004 0.0063

next seg. has same POA -0.0035 0.0071 -0.0150 0.0291 0.5188 0.7754

distance word start (log) -0.0480 -0.0496 -0.0735 -0.0269 0.0001 0.0003

word frequency (log) -0.0121 -0.0123 -0.0163 -0.0082 0.0001 0.0000

seg. probability (log) 0.0854 0.0944 0.0745 0.1140 0.0001 0.0000

seg. informativity 0.0454 0.0515 0.0391 0.0631 0.0001 0.0000

seg. predictability (log) 0.0041 0.0041 -0.0001 0.0080 0.0470 0.0627


2.9.4 Postvocalic segment deletion model

Factor Estimate Std. Error z value Pr(> |z|)Intercept -3.8959 0.5217 -7.4675 0.0000

rate of speech (log) 1.3780 0.1138 12.1109 0.0000

base deletion ratio for POS (log) 0.8591 0.0666 12.8927 0.0000

next seg. manner: glide -1.8102 0.3099 -5.8413 0.0000

next seg. manner: liquid 0.8470 0.0841 10.0758 0.0000

next seg. manner: nasal 0.4439 0.1596 2.7820 0.0054

next seg. manner: stop 0.1320 0.0649 2.0330 0.0421

next seg. manner: affricate 0.7641 0.1555 4.9152 0.0000

manner: nasal 1.4454 0.0919 15.7216 0.0000

manner: stop 1.2202 0.1100 11.0919 0.0000

manner: affricate -0.9003 204.4568 -0.0044 0.9965

seg. POA: dorsal -0.4479 0.1932 -2.3181 0.0204

seg. POA: labial 0.5075 0.1272 3.9911 0.0001

seg. is voiced 1.7014 0.1211 14.0458 0.0000

next seg. is voiced -0.9739 0.0651 -14.9563 0.0000

distance from word start (log) 0.7052 0.0758 9.3054 0.0000

seg. subplace: dental 4.3805 0.3735 11.7274 0.0000

seg. subplace: palatal -9.1966 131.8608 -0.0697 0.9444

next seg. has same POA 0.1213 0.0928 1.3072 0.1911

distance from end of phrase (log) 0.0798 0.0254 3.1454 0.0017

primary stress vowel precedes 0.1084 0.0600 1.8071 0.0708

secondary stress vowel precedes -0.0346 0.1466 -0.2358 0.8136

word frequency (log) 0.0519 0.0115 4.5175 0.0000

seg. probability (log) -0.6912 0.0840 -8.2335 0.0000

seg. informativity -0.3961 0.0438 -9.0479 0.0000

seg. predictability (log) 0.0356 0.0160 2.2248 0.0261

Chapter 3

Faithfulness as Information Utility

3.1 Introduction

Across languages, weakening processes such as lenition, deletion and neutralization

may target any given segment of speech. Currently, linguistic theory does not have

means of predicting why some segments are more likely to undergo weakening in

some languages rather than in others. However, it is clear that in a number of un-

related languages (e.g. English, Arabic and Huallaga Quechua) a segment or a set

of segments is targeted by parallel weakening processes : several unrelated weakening

processes. In some of these cases certain segments are weakened in multiple environ-

ments of a single language variety, while in other cases certain segments are weakened

in a single environment, but the weakening pattern differs among several varieties of

the language. When parallel weakening processes can be shown to be structurally

or functionally unrelated to one another, they present a significant challenge to lin-

guistic theory, as there is currently no disciplined way to attribute a tendency to

weaken to a subset of segments in a language. The possibility of accounting for a

segment’s tendency to weaken touches upon the famous actuation problem presented

in Weinreich et al. (1968), as such an account would shed light on some of the reasons

that lead to actuation of structural changes in human language. Here, I show how

characterizing linguistic faithfulness as a pressure to preserve information – as quan-

tified in information theory – can lend insight into why language-specific pressures

weaken particular linguistic elements. Combined with extant effort-based accounts of

48

CHAPTER 3. FAITHFULNESS AS INFORMATION UTILITY 49

markedness, I show it is possible not only to describe, but also to explain and predict

language-specific weakening patterns.

Despite the cross-linguistic rarity of weakening processes that target coronals

exclusively (Gurevich, 2004), American English has at least two unrelated /t,d/-

weakening processes: intervocalic tapping (Kahn 1976 and Zue and Laferriere 1979

among others) and word-final deletion (Guy 1991 among others). Similarly, in Hual-

laga Quechua (Weber, 1989), /q/ surfaces as [g] or [G] in intervocalic contexts and

as [x] before voiceless obstruents; it deletes word-finally. In both languages intervo-

calic and word-final weakening share only the segment being targeted, as they occur

in different environments and result in different outcomes. Both American English

and Huallaga Quechua therefore exhibit a case of parallel weakening processes of a

segment in a single variety.

Dialects of Arabic and UK English provide an even more puzzling case of paral-

lel weakening processes, as in both languages there are different, parallel weakening

processes of the same segment in different varieties : /t/-debuccalization, tapping

and spirantization in UK English dialects (Mathisen 1999; Urszula 2004; Raymond

2004 among many others) as in the Irish English varieties in (3.1; data from Hickey

2009), and /q/→[g], /q/→[P] and /q/→[k] weakening in Arabic dialects as in the di-

alects in (3.2; data from Kaye and Rosenhouse 1997), all spoken in close geographical

proximity.

(3.1) Variety ‘butter’

Northern varieties [b2tˆ@ô]

Southern varieties [b2R@õ]

Vernacular Dublin [bUP5]

(3.2) Dialect baqara ‘cow’ (MSA)

Druze [baqara]

Nazareth [bakara]

Jerusalem [baPara]

NW Jordan [bagara]

It is important to note that both UK English and Arabic demonstrate parallel weak-

ening and cannot be regarded as a more extreme application of weakening along a

single cline. In English, the tapping of /t/ retains the place but not the manner

of its articulation while the debuccalization of /t/ retains the manner but not the

place of its articulation. In Arabic, debuccalizing varieties can be shown not to have

undergone a /q/→[k] process, as /k/ remains a distinct phoneme in such dialects.


Currently, linguistic theory cannot explain why several varieties of a single language

would undergo such similar yet independent structural changes.

Some of the current theoretical explanations for the typology of possible and

impossible structural changes in language (Weinreich et al.’s constraints problem)

rely on finding an equilibrium between two or more universal forces. For example,

Optimality Theory (Prince and Smolensky, 1993) balances markedness constraints

with faithfulness constraints. The principle of least effort (Zipf, 1949) balances the

speaker’s economy with the auditor’s economy – speakers’ desire to reduce their ef-

fort is bounded by their need to make themselves understood. Modern phonological

theories that balance effort with perceptual confusability (Flemming, 2004; Boersma,

2003) can also be viewed as balancing speakers’ economy with listeners’ economy.1

By balancing two opposing forces such theories rule out certain linguistic configu-

rations or structural changes. For instance, a change that is undesirable on both

scales (such as increasing effort as well as confusability) is predicted not to occur.

Such theories allow for multiple possible equilibria when a change is desirable on one

scale and undesirable on another (such as increasing effort while decreasing confus-

ability or decreasing effort while increasing confusability). However, since multiple

equilibria are available for universal functional forces it is impossible to predict which

equilibrium each language will choose.2 Consequently, no theory that is based on

balancing universal forces can predict which segments are likely to be weakened in

some language.

To compensate for these limitations, I introduce a new functional force, the preser-

vation of information utility. Information utility represents the amount of information

speakers attribute to linguistic elements, and its preservation applies variably: that

is, the more information a linguistic element carries, the stronger the desire to pre-

serve it. I show that like articulatory effort and confusability, information utility can

be estimated by speakers over language use. However, unlike articulatory effort and

confusability, the information utility of linguistic elements differs across structurally

1In pragmatics the Q and R Principles of Horn (1984) can be viewed as balancing the same twoforces, and Horn does indeed make this comparison.

2Eternal optimization (Boersma, 2003) relies on the inability to choose among multiple gooduniversal equilibria.


similar languages. Therefore, while the desire to preserve information utility is univer-

sal, a particular linguistic element may be subject to stronger or weaker preservation

forces in some languages as opposed to others. Indeed, other things being equal, lan-

guages in which a given linguistic element (such as /t/) has relatively low information

utility will be less likely to preserve that element.

Building on both effort and information utility, I show that the distribution of

weakening processes in different languages emerges from the interaction between two

opposing forces: avoiding effort and maximizing information utility. Effort avoidance

follows from the principle of least effort (Zipf, 1949), and applies variably: that is, the

more effort the articulation of a given linguistic element requires, the less speakers

will be willing to make that effort. Information utility maximization applies variably

as well – the higher the information utility of a linguistic element, the stronger the

pressure to correctly transmit it to listeners. Under this framework, weakening pro-

cesses are actuated when the effort required by the articulation of a linguistic element

exceeds the effort justified by its information utility. The proposed account suggests,

then, that the actuation of weakening processes is explicable in information-theoretic

terms.

Optimality Theory is well-suited to describe the way a grammar may balance

leveling and preserving forces. In OT structural leveling is motivated by marked-

ness constraints, while preservation is motivated by faithfulness constraints. To bet-

ter subject the proposed constraints to empirical scrutiny, in what follows, I model

markedness in terms of effort, and faithfulness in terms of information utility. I then

show how these two opposing forces can be used to aptly predict the distribution of

weakening process across languages, once given this quantitative characterization.

3.2 Parallel weakening processes

3.2.1 Weakening patterns are not arbitrary

One of the main focuses of every theory in phonology is that of describing phonological

alternations. Both diachronic sound change and synchronic alternations demonstrate

that segments are mutable under certain conditions, and various theoretical frame-

works describe how a particular structural change can spread through the sound


system of language communities and speakers (Kiparsky 1995; Pierrehumbert 2001

among many others). However, such theories do not attempt to predict what would

lead a given language to undergo particular structural changes rather than others.

Pierrehumbert (2001), for instance, acknowledges that “in a complete model of his-

torical change it will be necessary to offer some explanation of why certain languages

at certain times begin to permit particular leniting changes while not permitting oth-

ers.” Ohala (2003, §3 pp. 684) goes further by suggesting that it is impossible to

predict the actuation of structural changes: “as far as the initiation of sound change

is concerned, this question may be unanswerable and not worth pursuing.”

In linguistic theory, unanswerable problems typically arise either when a given

change exhibits a (seemingly) random pattern or when that change is caused by

extra-linguistic sources, such as language contact. However, as I aim to show, the

accumulation of weakening processes that target a single segment or set of segments

in one language can be regarded neither as random, nor as solely explicable in terms

of extra-linguistic processes. On the contrary, I contend that there is ample room in

linguistic theory to try and predict which linguistic elements are likely to undergo

weakening in a given language.

3.2.2 Same language, same segments, multiple processes

Current theoretical frameworks are unable to give a consistent account for the ac-

cumulation of language-specific weakening processes that target a set of segments.

American English, which demonstrates a range of /t,d/-weakening and optional dele-

tion processes, proves illustrative. In American English, final /t/ and /d/ delete in

varying rates (Guy 1991 among others), intervocalic /t/ and /d/ are tapped (Kahn

1976; Zue and Laferriere 1979 among others) and even word-medial /t/ and /d/ are

more likely to delete than other stops (chapter 2). Yet no principled account has been

proposed to account for the multitude of /t/ and /d/ targeting processes.

Moreover, extant theory fails to explain the selective nature of such processes.

For instance, final consonant deletion in English targets /t/ and /d/ and not other

stops. Similarly, only /t/ and /d/ tap intervocalically, while other oral stops do not

become sonorants in the same environment. While they share a set of segments,

the two processes are otherwise unrelated, as they arise from different pressures that


apply in different environments In American English the difference between the two

environments is evident in the different outcomes of the two processes: sonorization

in intervocalic environments, and deletion word-finally.

Many theories including Optimality Theory (Prince and Smolensky, 1993; Mc-

Carthy and Prince, 1995) can explain these two processes independently of one an-

other, but might just as easily account for a minimally different English, k-English, in

which /t,d/-tapping remains unchanged, /t,d/-deletion does not occur, but /k/ does

delete word-finally. There is thus no principled reason on offer for why English, as

commonly spoken, and not k-English, is the observed outcome of these processes.

Moreover, difficulties arise in accounting for similar processes independently from

one another. The chief problem is the likelihood of finding such a system. In a cross-

linguistic comparison of intervocalic lenition processes, Kaplan (2010) reports that out

of the 136 languages and dialects in Gurevich (2004) that also have a full consonant

inventory, only American English exclusively weakens coronals intervocalically. In a

survey of lenition processes in 272 languages and dialects in Kirchner (1998), obliga-

tory word-final deletion applies exclusively to coronal stops only in Umbrian (mostly

to /d/), but Buck (1904, p. 146, footnote 1) reports that final /k/ was also “weakly

sounded” and “frequently omitted in writing.” Additionally, excluding dialects of

English, only two languages (Middle Egyptian, Limbu) have some other word-final

coronal-only lenition process.3 Since both processes are rare cross-linguistically, the

odds of having both occur in a single language by chance alone are very small.

The case of multiple processes targeting the same set of segments is not unique

to American English. In Huallaga Quechua (Weber 1989, also summarized in Kirch-

ner 1998), /q/ surfaces as [g] or [G] in intervocalic contexts, as [x] before voiceless

obstruents and deletes word-finally. Only /q/ displays such a range of processes in

Huallaga Quechua.

Both American English and Huallaga Quechua dramatically contrast their respec-

tive weakening processes by both the environment in which they appear (intervocali-

cally, word-finally) and the outcome of the process (lenition, deletion). However, one

can argue that even when the outcome of the process remains constant across envi-

ronments, any case in which environment-specific weakening spreads to environments

3 Word-finally, Limbu has a /t/→[Pl] process and Middle Egyptian has a /t/→[P] process.


that do not motivate the application of the original process may be regarded as a

case of parallel processes.

Such is the case of Uyghur (Hahn and Ibrahim 1991, summarized in Kirchner

1998), which has a /å/→[K] process that applies both in intervocalic contexts and

word-initially. While in intervocalic contexts Uyghur has additional spirantization

processes that affect all velar and uvular stops as well as /b/, only /å/ weakens also

word-initially (only the weakening of /å/ “drifted” to another position). Word-initial

environments do not typically lead to spirantization, suggesting that the spiranti-

zation of /å/ in word-initial contexts does not follow from the same cause as the

segment’s intervocalic spirantization. Therefore, the word-initial application of å-

spirantization does not merely represent a wider environment for a single process,

but rather a separate, parallel process with an identical outcome.

The accumulation of weakening processes that target a specific set of segments

appears in several unrelated languages. However, current linguistic theory does not

predict that such accumulations can be linguistically motivated. The next section

discusses a closely related phenomenon – parallel weakening across several language

varieties.

3.2.3 Same language, different dialects, similar processes

The accumulation of language-specific weakening processes that target a set of seg-

ments appears not only in several environments in single language variety, but also

across several varieties of a single language.

In several varieties of British English, /t/ is the target of several different socially-

conditioned weakening processes in intervocalic environments. The most famous and

widespread pattern is debuccalization (as in Mathisen 1999, among others), in which

/t/ surfaces as a glottal stop. Other processes can also be found. In Irish English

varieties /t/ surfaces as [R], [P], [h], ∅ (deletion) and [t„], an apico-alveolar fricative

(Raymond, 2004).4 Similarly, in West Midlands English varieties /t/ may surface as

[t] (unchanged), [P] and [R] (Urszula, 2004).

The weakening processes that lead to the emergence of [P] and [R] realizations of

/t/ are incompatible with one another. When /t/ glottalizes as in (3.3a), its place of

4I follow Raymond (2004) in using a t-based symbol for that fricative.


articulation is not preserved, but it is still a voiceless stop. When /t/ is tapped as in

(3.3b), its place of articulation is preserved, but the manner of articulation is not, as

/t/ becomes an approximant.5

(3.3) (a) ‘water’ [wOP@]

(b) ‘water’ [wORÄ]

Glottalization must therefore evolve from a stage in which intervocalic /*t/ is still

a stop, and tapping must evolve from a stage in which intervocalic /*t/ is still a

coronal. These are therefore parallel weakening clines (3.4). Each individual step on

a weakening cline is a parallel weakening process.

(3.4) Two weakening clines are considered parallel when both clines target an

identical set of linguistic elements, and the input form of either cline could not

have been the output form of the other cline.

Consider an ancestor of P-varieties and R-varieties in which /*t/ is still a voiceless

coronal stop, the t-variety. Since neither change has occurred yet, the speakers of

the t-variety have no evidence that a change is about to happen. However, when

the varieties eventually split, they undergo similar parallel weakening processes. The

varieties that would end up as P-varieties debuccalize or coarticulate a glottal stop

with the voiceless coronal stop, whereas the varieties that would end up as R-varieties

tap, voice or sonorize the voiceless coronal stop.

Parallel weakening processes can be explained in one of three ways. The first

approach is to assume that the parallel processes have indeed developed independently

of one another, by unrelated grammatical changes that happened to achieve similar

results in both P-varieties and R-varieties. The second approach is to assume that even

though the processes are independent of one another, parallel processes are a result

of the ongoing contact between the two families of varieties (a shared innovation).

The third approach is to search for some property of English that would have led

both processes to occur, something that would make the English /t/ more prone to

undergo weakening than other segments and therefore more vulnerable to change.

5A similar argument can be made for varieties in which /t/ surfaces as a fricative [t„].


The argument against the first approach – the independent change account – is

easily stated, and indeed is similar to the argument mentioned in the previous section,

against the independent co-occurrence of multiple different processes that target the

same segments in different environments. In short, the chances of such similar changes

co-occurring independently are quite small. Note that while varieties in which /t/

weakens share that weakening as a common property, the particular property of /t/

that is not being preserved is different in each case. For instance, while P-varieties

require that /t/ cannot have a place of articulation in intervocalic contexts, they

do preserve the place of articulation of other segments. R-varieties and t„-varieties,

meanwhile, require a segment to be sonorant or continuant in intervocalic contexts,

while allowing other stops to faithfully surface as stops in similar contexts. Since the

grammars of all varieties evolved from a grammar in which /t/ surfaced as a coronal

stop, and since the odds of having any intervocalic weakening of exclusively coronal

stops is quite small (in the Kaplan 2010 survey of 136 languages and dialects only

dialects of English weaken exclusively coronal stops in intervocalic positions), the

odds of having both processes occur in different dialects by chance alone are highly

improbable.

Let us turn then to the second approach. Could contact between two families

be a good reason for the appearance of parallel processes? Hypothetically, a speaker

of a /t/-variety that is exposed to a P-variety (for instance) might perceive that /t/

is prone to weakening in intervocalic positions and is therefore subject to change.

At a glance, this proposal seems plausible. However, there is a problem with this

suggestion. The exposure to a P-variety would also include exposure to a particular

solution to that problem: glottalization. Were the speaker to embark on a different

weakening process, it would require the speaker to heed only one of these cues, the

weakness of /t/, and not the particular solution, glottalization.

Parallel weakening clines are not unique to English. Classical Arabic /*q/ (Mod-

ern Standard Arabic /q/) has many different surface forms across spoken Arabic

dialects. Kaye and Rosenhouse (1997) report that /*q/ tends to weaken to [P], [k],

[g] and less frequently to [ G], as seen in the dialects presented in (3.2) repeated here

as (3.5).


(3.5) Dialect baqara (Modern Standard Arabic)

‘cow’

Druze /baqara/

Nazareth /bakara/

Jerusalem /baPara/

NW Jordan /bagara/

One could give a trivial account for the many forms /q/ takes in Arabic’s different

dialects by claiming that each form is a stage on a single cline, namely the one in

(3.6), in which it is possible to imagine successive weakening processes leading from

/q/ to /P/.

(3.6) q → G → g → k → P

However, the proposed series of changes in (3.6) has to be rejected, since it would

collapse phonemic distinctions that remain intact in dialects such as Egyptian Arabic

(Kilany et al., 1997). In Egyptian Arabic /*q/ surfaces as [P], at the very end of

the hypothetical cline in (3.6). If /q/ weakening had followed the series of changes

proposed in (3.6), then q → G → g would have collapsed /q/ and /g/, and g → k

would have collapsed /g/ and /k/. Yet in Egyptian Arabic all three remain distinct.6

The /q/→[g] and /q/→[k] processes are therefore independent of the /q/→[P] process

that is found in Egyptian Arabic, providing another case of parallel weakening.

Both Arabic and UK English thus exhibit parallel clines of a single segment in

different dialects. In both cases, it is unreasonable to claim that the shared target of

the weakening process is merely due to chance. At the same time, dialectal contact

must also be ruled out. What remains to be seen is whether the third approach might

prove fruitful: whether, perhaps, UK English has some linguistic property that leads

it to weaken /t/ preferentially over other segments, while Arabic has some linguistic

property that leads it to preferentially weaken /q/. This question is the focus of the

rest of this chapter.

6[g] and [k] are the surface forms of Classical Arabic /*Ã/ and /*k/, respectively. /*g/ is thepredecessor of /*Ã/ in Proto-Semitic, and may represent an earlier stage of Arabic.


3.2.4 The challenge of explaining parallel weakening

The accumulation of parallel weakening processes in a given language suggests the

existence of language-specific conspiracies (Kisseberth, 1970). Taken together, the

cases of English, Arabic and Huallaga Quechua provide illustrative examples of par-

allel weakening on the opposite ends of the scale of voiceless stops. While /t/ is

unmarked and frequent in the worlds’ languages (Maddieson, 1984), /q/ is marked

and infrequent. Thus, any universal scale that could apply to both /t/ and /q/ could

also apply to any other subset of voiceless stops (and the standard treatments of

markedness hierarchies can target other subsets of voiceless stops; see discussion in

§3.6). In a similar vein, there is no obvious language-specific functional scale that

could group the English /t/ and Arabic /q/ together and exclude the other voice-

less stops. In both languages /t/ is the most frequent voiceless stop (data from the

studies in §3.5), and the articulatory, acoustic and perceptual effects both /t/s have

are similar, yet English /t/ is subject to parallel weakening processes and the Ara-

bic /t/ is not. It is not clear what linguistic factors can account for the different

language-specific weakening patterns.

The goal of this chapter is not only to provide an explanation or a formal descrip-

tion for each process separately, but to find a common cause for each of these parallel

weakening patterns: in short, to explain weakening, rather than describe it.

3.3 The proposed account

3.3.1 Outline – replacing markedness hierarchies

Optimality Theory’s standard treatment of linguistic phenomena balances marked-

ness on the one hand and faithfulness on the other. Markedness represents the various

leveling forces that affect linguistic performance, and does not apply uniformly. It

is standard practice to assume markedness hierarchies that correspond to linguistic

universals. For instance, dorsal segments are considered more marked than coronal

segments, and therefore any language that allows dorsals to appear in some position

(complex onsets, syllable codas) should also allow coronals to appear in the same


position. Markedness is opposed by faithfulness, which represents the various pre-

serving forces that operate in language. Faithfulness is assumed to follow the same

hierarchy as markedness, and does not apply uniformly either. In §3.6 I show that if

the two scales are treated as identical, or if both scales follow universal hierarchies,

language-specific weakening cannot be predicted.

In recent years markedness has often been identified with phonetic factors such

as articulatory effort avoidance and increased confusability (Kirchner, 1998; Steriade,

2008; Boersma, 2003; Flemming, 2004). Following this practice, markedness hier-

archies may be identified with various levels of effort and confusability. However,

phonetic factors alone cannot capture why faithfulness is not applied uniformly. For

instance, though no one claims that word final /t/ should require more effort than

word-final /k/, it is notable that American English allows word-final deletion of /t/

but not (correspondingly) of /k/.

Here, I build on previous work on information-theoretic (Shannon, 1948) ap-

proaches to language (van Son and Pols 2003; Aylett and Turk 2004; Levy and Jaeger

2007; Jaeger 2010 and others) to introduce a new universal functional force, infor-

mation utility, which motivates the preservation of linguistic elements. Infrequent,

unpredictable and informative linguistic elements provide more information than fre-

quent, predictable and uninformative linguistic elements, and therefore have higher

information utility. I propose that high information utility necessitates low confusabil-

ity and justifies the expenditure of effort to achieve low confusability. Accordingly,

I propose that languages place linguistic elements with high information utility in

prominent positions, and that speakers match their willingness to preserve linguis-

tic elements with the amount of information they expect those elements to provide.

Importantly, the information utility of linguistic elements is language-specific. For

example, American English /s/ may provide more information than Spanish /s/, and

may therefore be subject to a greater preserving force than Spanish /s/. The differ-

ent amounts of information emerge from language use and are therefore available to

speakers, much like universal properties such as effort.

The proposed account combines the novel treatment of information utility as a

preserving force with current effort-avoidance accounts. I assume that speakers at-

tempt to preserve as much information as possible (Most information Utility) while


avoiding effort (Least Effort) or MULE. More specifically, the higher an element’s in-

formation utility is, the less confusable it should be, and by necessity, the more effort

its articulation justifies.7 The link between information utility and effort is monoton-

ically increasing. As the articulatory effort increases, so does the minimal amount

of information utility that could justify that effort (or does not diminish). The ex-

act threshold function of minimal information utility per effort is language-specific.

Weakening is predicted to occur when a linguistic element’s information utility is not

high enough to justify its faithful pronunciation.

In a sense, MULE is an extension of Haspelmath (2006), who shows that marked-

ness is not a uniform concept and proposes to replace it with frequency. Similarly,

Hume (2004) proposes to replace markedness with predictability. However, there are

key differences. In MULE, the information-theoretic measurements represent faithful-

ness rather than markedness, and information estimates (mainly informativity, Cohen

Priva 2008; Piantadosi et al. 2011) are used to evaluate information utility rather than

frequency or predictability. The use of information estimates is discussed in §3.4.

3.3.2 Using information utility and effort to predict parallel

weakening

In MULE linguistic elements will tend to weaken when their information utility is too

small to justify the articulatory effort associated with their faithful pronunciation.

The graph in (3.7) represents a sample language L1 in some linguistic environment

V (for instance, syllable coda). There are four linguistic elements in L1 marked as

α, β, γ and δ. Their position in the graph is determined by the effort that their

pronunciation requires in V (3.8) and by the information utility (3.9) each of them

provides. The diagonal line passing through the graph is the monotonically increasing

relationship between effort and information utility in L1. The gray zone represents

the area in which an element’s information utility is not high enough to justify its

faithful articulation. In (3.7), α and γ are outside the gray zone but β and δ are

inside the gray zone. This means that the information utility of β and δ is not high

enough to justify their faithful pronunciation, and they are more likely to be targeted

7 Information utility is just one in a range of possible utility accounts which include pragmatic,semantic, syntactic, morphological and sociolinguistic factors.


by a weakening process in the V environment of L1 than α and γ are.

(3.7)

α

β

γ

δ

effort

info

rmati

on

uti

lity

(3.8)

effort (α in V ) < effort (β in V ) < effort (γ in V ) < effort (δ in V )

(3.9)

information-utility (β) < information-utility (α) <

information-utility (δ) < information-utility (γ)

Notably, comparing effort and information utility is not straightforward. First,

there is currently no account that assigns a quantity of effort to a linguistic element.

Instead, effort is evaluated by comparison: it is possible to say that voicing /b/ is

easier than voicing /g/ (Ohala, 1981), but not that /b/ requires some amount of

effort and /g/ requires twice as much effort. Second, even if it were possible to assign

numeric values to both effort and information utility, each measurement would have

different units, and it might not be possible to know beforehand what kind of scaling

would allow a comparison between the two measurements. Moreover, it is not unlikely

that the comparison between the different measurements is language-specific: some

languages might require a higher amount of information utility for the same amount

of effort. In sample language L2 (3.10) the information utility and articulatory effort

of pronouncing α, β, γ and δ in environment V are the same as in L1, but a different

function links information utility and effort, making α and β rather than β and δ

prone to undergo weakening in the environment V in L2.


(3.10)

α

β

γ

δ

effort

info

rmati

on

uti

lity

Thus, assessing which element is more likely to undergo weakening relies on com-

paring both the articulatory effort required by each linguistic element and the infor-

mation utility it provides. The simplest case is binary comparison: if some linguistic

element e1 requires more articulatory effort than another element e2 and provides less

information utility than e2 does, then e1 will necessarily be more prone to undergo

weakening than e2 since there is no monotonically increasing function linking effort

and information utility that could achieve that. For instance, the linguistic element

β in (3.7) and (3.10) requires more articulatory effort than α in environment V and

provides less information utility, and indeed there is no way to put α in the gray zone

without also putting β in the gray zone. §3.5.2 demonstrates how binary comparison

can be used to account for the debuccalization of /q/ in Egyptian Arabic.

Another alternative is to compare the real-valued distance between the information

utility of different linguistic elements while controlling for the effect effort may have.

For instance, if several languages have a similar segment inventory but differ with

respect to the information utility of each segment, it is possible to compare the

information utility that segments have in each language, since the articulatory effort

and confusability of the segments should remain constant. Consider the two languages

in (3.11–3.12). Both languages have two oral stops: /t/ and /k/, but the information

utility of /t/ is lower in in (3.11) than in (3.12) while the information utility of /k/ is

comparable. If we assume that the effort required to pronounce /t/ and /k/ is similar

in both languages, and just compare information utility across the two languages,

it becomes apparent that it is easier to link effort and utility so that /t/ does not

provide enough information utility in (3.11) than in (3.12). §3.5.3 shows how real-

valued comparison can be used to account for multiple /t,d/-weakening processes in


American English.

(3.11)

/t/

/k/

effort

info

rmati

on

uti

lity

(3.12)

/t/

/k/

effort

info

rmati

on

uti

lity

3.4 Implementation

3.4.1 Implementation overview

In order to correctly test the predictions made by MULE, it is necessary first to know

how to quantify its different components: articulatory effort, perceptual confusability

and information utility.

3.4.2 Measuring information utility

As a measure, information utility relies on the amount of information speakers esti-

mate a linguistic unit would hold. In this chapter I approximate information utility

by using the informativity of linguistic units (Cohen Priva, 2008; Piantadosi et al.,

2011), as discussed in chapter 2.

The amount of information some linguistic element e provides when the context

it appears in c is already known is the predictability of e given c (3.13), which can be

estimated using counts as in (3.14). The predictability of e given c is equivalent to

measuring how predictable e is given that c is already known.

(3.13)

− log Pr (e|c)


(3.14)

− logcount (e, c)

count (c)

Importantly, the predictability of a linguistic element can change markedly from

one context to another. The segment /m/ provides almost no information in some

context (after ‘acade–’) but provides much information in another (/m/ in ‘dim’

provides a lot of information since /m/ rarely follows /dI/). Predictability is taken

to be an important factor in a number of language processing accounts (Levy and

Jaeger 2007; Raymond et al. 2006 to name a few). As a measure, it relies on the

amount of information a given element provides in a particular context.

However, predictability alone cannot successfully capture the exceptionless prop-

erties of sound change. For instance, in a language that licenses a weakening process

that targets some segment, that segment may weaken regardless of how much in-

formation it conveys in a particular context. The conspiracies discussed above all

constitute cases in which a segment is prone to undergo weakening uniformly in some

linguistic environment, not merely in words in which it is predictable. To better

account for this, I use informativity, the average (or expected) information value of

linguistic elements which remains constant across contexts. A linguistic element that

usually provides little information will have low informativity, regardless of how much

information it provides in the actual context it appears in. For example, English /N/

appears mostly as the second segment of the ‘–ing’ suffix, in very predictable contexts.

Therefore, /N/ has low informativity even in the few contexts in which it does provide

a lot of information. Informativity (3.15) is defined as the weighted average of the

predictability of a linguistic element given all possible contexts, c ∈ C. I will discuss

the use of informativity in §3.7, and compare it to some of its alternatives such as

frequency and predictability.

(3.15)

−∑c∈C

Pr (e, c) log Pr (e|c)Pr (e)


3.4.3 Integrating effort into MULE

In line with current phonetic theories, MULE treats articulatory effort and perceptual

confusability as the functional forces that motivate markedness. Importantly, previ-

ous work has shown that effort cannot be modeled without taking the phonological

context into consideration (Pouplier, 2003). I therefore model effort and confusability

jointly in MULE. Doing so is necessary because effort has to be evaluated in each

linguistic environment separately. For instance, intervocalic voicing of obstruents

decreases their distinctness from the surrounding vowels and may be considered a

reduction in effort and an increase in confusability (Boersma, 2003). However, main-

taining a distinction in voicing in codas is considered difficult (Steriade, 2008). The

effort associated with intervocalic voiceless obstruents should therefore be higher than

the effort associated with voiced obstruents. Conversely, the effort associated with

voiceless obstruents in codas should be lower than that of voiced obstruents in codas.

In order to model articulatory effort and perceptual confusability jointly I assume

that given some amount of articulatory effort, speakers choose the least confusable

form allowed for that amount of effort. This assumption follows Flemming (2004). For

any given segment an increase in the articulatory effort is motivated by an attempt to

decrease confusability; increased effort that does not result in decreased confusability

(or support additional distinctions) is ruled out.8

For the most part, MULE only requires comparison of functional pressures: i.e., an

estimate of which surface form requires more effort than other surface forms. There-

fore, for the purposes of this chapter, whenever two similar segments are directly

compared, the standard OT markedness hierarchy (de Lacy, 2002) is used to rank

effort, and whenever two dissimilar segments are compared, their respective frequen-

cies in the worlds languages (using Maddieson’s 1984 UPSID) are computed to the

same end.

For future applications of MULE it would be desirable to use more fine-grained

comparisons, as would follow from the study of phonetics, integrating articulatory

effort (Ohala 1981 among others), and confusability (Steriade 2008 among others).

8The organization of the language allows additional mechanisms that will not be discussed here,such as moving linguistic elements that require low confusability to prominent positions.


3.5 MULE in OT

3.5.1 Effort and information utility in OT

MULE deals with the optimization of speakers’ performance across two axes: maxi-

mizing information utility while minimizing the effort required for their articulation.

OT provides a mechanism to distinguish between the two: markedness constraints

lead to simplification, and faithfulness constraints lead to preservation. I therefore

propose an OT implementation of MULE that bases faithfulness on information util-

ity, and markedness on effort and reduced perceptibility. The OT version of MULE

allows integration with current phonological theories.

Like standard OT with marked faithfulness, the ranking of the faithfulness hierar-

chy is predetermined and based on the distribution of segments in the language, and

so is the ranking of the markedness hierarchy. Faithfulness and markedness might not

be ranked in the same order, but both are observable by speakers by their exposure

to the language. Therefore, the ranking in a MULE OT grammar is not more difficult

to learn than standard OT ranking.

Faithfulness is expressed using information utility, and has a total order ranking

that is predetermined by the distribution of linguistic elements in the language: the

higher the information estimate of some element, the more the language tries to

preserve it.9 If E is a type of linguistic element (feature, segment, morpheme), then for

every two elements of type E, e1, e2, faithfulness to e1 will be greater than faithfulness

to e2 if and only if the information estimate of e1 is greater than the information

estimate of e2, as shown in (3.16).

(3.16)

∀e1, e2 ∈ E. info-estimate(e1) > info-estimate(e2)↔

faithfulness (e1)� faithfulness (e2)

9It is quite possible that speakers would not be certain whether one linguistic element has higherinformation utility than some other linguistic element. In that case the alternatives diverge based onthe theoretical framework. If the framework allows two constraints to remain unranked with respectto one another, they will remain unranked. In a multiple grammars framework, speakers may haveboth possible rankings. Other frameworks may collapse the distinction between the two elements.


Segment and feature markedness is expressed likewise, by a total order ranking on

the effort of articulating a segment or a feature so that they may be correctly perceived

in the environments they appear in, and resist environment-optimizing pressures. If

E is a type of linguistic element and V is an environment (syllable coda, unstressed

syllable), then for every two elements of type E, e1, e2, the markedness of e1 will be

greater than the markedness of e2 if and only if the effort required to pronounce e1 in

the environment V so that it is perceptually identifiable is greater than the effort that

it takes to pronounce e2 in the environment V so that it is perceptually identifiable,

as shown in (3.17).

(3.17)

∀e1, e2 ∈ E. effort(e1 in V ) > effort(e2 in V )↔*V (e1)� *V (e2)

In order to rank markedness of segments across all environments it is possible to cal-

culate an aggregate measure of effort that would rank the effort of linguistic elements

without an environment or at a special ∅ environment.

3.5.2 Binary comparisons in OT

Binary comparison constraint ranking

Without the use of real-valued weights, the proposed OT implementation can only

implement binary comparison (§3.3.2). Consider a language with two linguistic el-

ements es and ew such that the effort required to pronounce es in environment V

is lower than the effort required to pronounce ew in the same environment, and the

information estimate of es is greater than the information estimate of ew as in (3.18).

es will weaken in V only if ew also weakens in V . The reason is that ew is more

marked than σs (3.19), but faithfulness to es is greater than faithfulness to ew (3.20).


(3.18)

es

ew

effort

info

rmati

on

uti

lity

(3.19)*V (ew)� *V (es)

(3.20)

Ident (es)� Ident (ew)

Any environment V in which a markedness constraint targeting es outranks faithful-

ness to es (3.21) will by transitivity be an environment in which the parallel marked-

ness constraint targeting ew outranks faithfulness to ew (3.22).

(3.21)*V (σs)� Ident (es)

(3.22)*V (ew)� *V (es)� Ident (es)� Ident (ew)

The ranking between faithfulness constraints and the reversed ranking between

markedness constraints in the case described above allows exactly three configura-

tions: both ew and es weaken as in (3.23), only ew weakens as in (3.24), and both ew

and es are preserved as in (3.25). There is no ranking in which es weakens, but ew

does not. In the examples weakened segments are struck out.

(3.23) Both weaken

ew.es*V (ew) *V (es) Ident{es} Ident{ew}

ew.es *! *

ew.es *! *

ew.es *! *

+ ew.es * *


(3.24) Only ew weakens

ew.es*V (ew) Ident{es} *V (es) Ident{ew}

ew.es *! *

ew.es *! *

+ ew.es * *

ew.es *! *

(3.25) Both preserved

ew.es Ident{es} Ident{ew} *V (ew) *V (es)

+ ew.es * *

ew.es *! *

ew.es *! *

ew.es *! *

Following this example, it is possible to follow a simple procedure to determine

whether a segment is expected to be prone to weakening (3.26)

(3.26) (a) Understand what causes the pressure to weaken

(b) Establish a comparison set : a list of segments that are under

a similar pressure to weaken: all the segments that could be

affected by the pressure to weaken in (a).

(c) Rank the segments by effort

(d) Rank the segments by information estimates

(e) Find mismatches between the effort and information estimate

rankings.

Binary comparison in MULE can predict a stronger pressure to weaken when the

effort and information estimate hierarchies are mismatched: there is some segment

that requires more effort to preserve in the relevant environment, but has a lower


information estimate. A counter-example to binary comparison in MULE would be

a case in which a segment that is both easier to maintain in some environment and

has a high information estimate weakens before segments that are more difficult to

maintain in that environment and have lower information estimates.

Binary comparison account for Egyptian Arabic /q/-weakening

Binary comparison can be used to account for the parallel weakening of /*q/ in

Arabic. §3.2.3 introduced the puzzle of parallel weakening clines of Classical Arabic’s

/*q/ to different surface forms in modern dialects: [g], [P], [k], and [ G]. Since some

of the weakening clines can be shown to have emerged in parallel, it seems that there

was some driving force in the stage of Arabic in which /*q/ was still a [q] that was

biased towards the possibility of having /*q/ weakening. The prediction in MULE is

that for some reason the information estimate of /q/ does not justify the amount of

effort that is associated with its faithful articulation.

In order to evaluate whether binary comparison justifies the weakness of /*q/, it

is necessary to understand in which environment /*q/ weakens, what is the pressure

that its weakening yields to (3.26a), which segments are under a similar pressure

to weaken (are in the comparison set, following 3.26b), rank the comparison set by

effort and information estimates (3.26c–d), and see whether the information estimate

of /*q/ is lower in that environment than that of other segments in its comparison

set (3.26e).

Most available corpora of Arabic are in Modern Standard Arabic, which is highly

influenced by literary standards rather than articulatory pressures. A notable ex-

ception is the corpus used in this study, LDC Egyptian Colloquial Arabic Lexicon

(Kilany et al., 1997), which contains the phonemic representation of words as they

are spoken in Egyptian Arabic, as well as usage counts for each word.

In Egyptian Arabic /*q/ surfaces as [P] (/q/→[P]), a debuccalization process that

applies in all environments. This means that the pressure /*q/ yielded to is akin to

OT’s *Place, forbidding segments with a place of articulation from expressing that

place. The comparison set consists of all the segments except the laryngeals (/h/

and /P/) and the pharyngeals (/Q/ and /è/). The laryngeals are excluded because

they are the outcome of debuccalization processes, and the pharyngeals are excluded


since their place of articulation is grouped together with laryngeals rather than with

other places of articulation starting with feature geometry (McCarthy, 1994) and

subsequently in OT (Lombardi, 1995).10 11

In order to evaluate the information estimate of each segment, the predictability

of each segment occurrence was calculated using the phonemic/phonetic transcrip-

tion and word counts as they appear in the LDC Egyptian Colloquial Arabic Lexicon

(Kilany et al., 1997), using the formula in (3.14). Geminate consonants were treated

as plain consonants followed by a special “gemination” symbol, rather than as two

occurrences of the same segment. Informativity was calculated as the weighted av-

erage of the predictability of all occurrences using the formula in (3.15). The list

of consonant informativity can be found in Table 3.1, which includes both core and

periphery segments, and distinguishes between the /P/ that has underlying /P/ and

tends to get deleted (/P/→[∅]), the underlying /q/ that surfaces as [P] (/q/→[P]) and

the underlying /q/ that is lexically specified to surface as [q] (/q/→[q]).12

Table 3.1: Egyptian Arabic Informativity-based information estimates

Phone Informativity Phone Informativity Phone Informativity/n/ 1.488 /Q/ 2.591 /f/ 3.241/l/ 1.564 /S/ 2.596 /z/ 3.267/R/ 2.070 /j/ 2.668 /g/ 3.420/t/ 2.102 /b/ 2.718 /è/ 3.537/dG/ 2.119 /s/ 2.889 /K/ 3.811/d/ 2.316 /w/ 2.957 /tG/ 4.172/m/ 2.451 /k/ 2.961 /DG/ 4.594/h/ 2.506 /sG/ 3.044 /x/ 4.773/P/→[∅] 2.561 /q/→[P] 3.169

The second step is to evaluate whether /q/ is more or less effortful than other

segments, and similarly to determine how it compares in terms of information utility.

10McCarthy (1994) also characterizes the pharyngeals as approximants, which suggests that theeffort-related pressures that target them are substantially different than those that target othersegments.

11Uvular fricatives (but not stops) may pattern with pharyngeals and laryngeals.12When Modern Standard Arabic words that contain /q/ are used in Egyptian Arabic, the /q/

may surface as [q], rather than as [P].


Based on cross-linguistic evidence, it is safe to assume that /q/ requires more effort

to articulate than all other voiceless oral stops, as the uvular place of articulation is

cross-linguistically marked. The information estimate (measured as informativity) of

/q/ is higher than all other voiceless stops, meaning that we cannot perform binary

comparison with other voiceless stops: /q/ is both more useful and requires more

effort than other voiceless stops. However, /q/ has lower information utility than a

number of fricatives (/z/, /f/ and /x/) and the voiced velar stop /g/, all of which are

more frequent than /q/ cross-linguistically: /q/ has only 52 occurrences in UPSID

(Maddieson, 1984), while /z/, /f/, /g/ and /x/ are more frequent cross-linguistically

(62, 180, 253, 94 occurrences, respectively).13 I can therefore assume that at least

a few of these segments require less articulatory effort to maintain their place of

articulation than /q/ does. The prediction is therefore that /q/ is more likely to

undergo debuccalization than the simpler fricatives would.

It is important to note that binary comparison does not predict that /q/ must

weaken. Instead, it predicts that /q/-weakening is likely to occur in many grammars

in which weakening does occur, as /q/ is more weak than many other segments.

The notion of grammars here is in the Kiparsky (1993) sense of the word: multiple

competing grammars in the mind of speakers. In a multiple-grammar analysis, having

more grammars in which /q/ weakens represents a stronger pressure to weaken, as in

Anttila (1997).

Other accounts would find it difficult to account for /q/-lenition:

• /q/ is not the most frequent nor the least frequent consonant in Egyptian Ara-

bic. Basing an explanation on frequency alone would predict that the most

frequent or least frequent segment would weaken first.

• /q/ is not the most marked segment in Egyptian Arabic, nor the least marked.

There are other uvulars in Egyptian Arabic that do not weaken, and the em-

phatic stops are more marked cross-linguistically. The fact that more marked

segments occur in Egyptian Arabic but may not weaken even in dialects in which

/q/ weakens makes it difficult to use a strict markedness-based account without

marked-faithfulness: more marked segments should have weakened first.

13Since stops are more frequent than fricatives cross-linguistically and voiceless stops are morefrequent than voiced stops, the difference is even more significant.


MULE has an advantage over the alternative accounts in not requiring /q/ to be on

the extreme edge of any single scale. It suffices that the information utility of /q/ is

not high enough to predict that it is a likely candidate for weakening.

3.5.3 Real-valued comparisons in OT

Real-valued comparison in real-valued models of OT

Implementing real-valued comparison (§3.3.2) in OT has to appeal to versions of OT

that can model variation using continuous values, such as Stochastic OT (Boersma,

1998; Boersma and Hayes, 2001) and MaxEnt OT (Goldwater and Johnson, 2003).

Currently, an OT grammar with constraint conjunction and marked faithfulness

(de Lacy, 2002, ch. 6) which is also real-valued would fit weights to each constraint

separately: the markedness of each environment, the markedness of each feature in

that environment, faithfulness to each feature and conjunction of features, etc. In

real-valued frameworks, fitting each constraint with its own weight takes the place

of strict constraint ordering: if the weight w1 of some constraint c1 is significantly

bigger than the weight w2 of another constraint c2, it means that c1 outranks c2. A

MULE version of real-valued OT frameworks can be stricter than current real-valued

frameworks: rather than fit each and every segment or feature with its own weight

w1...n, the learning algorithm can assign a single weight winfo-estimate to the informa-

tion estimate of all linguistic elements, reducing the number of weights that have

to be learned.14 Faithfulness to some segment σ1 with info-estimate (σ1) = x will be

winfo-estimate ·x, and faithfulness to another segment σ2 with info-estimate (σ2) = y will

be winfo-estimate ·y. In essence, the single weight winfo-estimate will only scale information

estimates, and would not allow a reranking of faithfulness in any other way.

Real-valued comparison account for American English /t,d/-weakening

Introduction Real valued comparison can rely on the distance between the infor-

mation estimate of different linguistic elements. If segments that require comparable

14 A single weight suggests a parsimonious model in which information utility is linearly correlatedwith preservation. A less parsimonious model would allow other monotonically increasing functions,but both alternatives are more easily falsifiable than models that fit each weight separately, as theyfit fewer parameters.


effort across different languages provide less information than other segments in one

language compared to other languages, they are expected to be weaker in that lan-

guage. This is the case with English /t/ and /d/. American English has multiple

/t,d/-weakening processes (§3.2.2), and UK English has parallel /t/-weakening pro-

cesses (§3.2.3). A MULE real-valued comparison account would expect /t/ and /d/

to provide less information than they do in other languages.

In order to evaluate the prediction that /t/ and /d/ are indeed weaker in English

than in other languages, the information estimate of the different consonants in En-

glish, Spanish and Egyptian Arabic was evaluated. In all cases the informativity of

the segments was used to assess their information estimate. For the English table

(Table 3.2) informativity was calculated using the word counts from the Switchboard

corpus (Godfrey and Holliman, 1997) and the Buckeye corpus (Pitt et al., 2007).

Phonetic representation was taken to be the one used in the CMU pronunciation dic-

tionary (Weide, 1998). Informativity was calculated using the same method used to

calculate informativity for Arabic segments in §3.5.2. For the Spanish table (Table

3.3) the CALLHOME Spanish Lexicon (Garrett et al., 1996) dictionary was used for

both word counts and phonetic representation, after collapsing a number of phonetic

distinctions: voiced oral fricatives were collapsed with voiced oral stops. Both con-

versational and read (radio transcripts) word counts from the corpus were used to

calculate informativity. The Egyptian Arabic table (Table 3.1) uses the same data

listed in the previous section.

Table 3.2: English Informativity-based information estimate

Phone Informativity Phone Informativity Phone Informativity/N/ 0.231 /s/ 2.464 /j/ 4.083/z/ 1.220 /k/ 2.579 /S/ 4.229/t/ 1.526 /m/ 2.964 /f/ 4.307/d/ 1.565 /D/ 3.350 /h/ 4.604/n/ 1.675 /w/ 3.713 /T/ 4.648/Z/ 1.899 /p/ 3.803 /g/ 4.735/v/ 1.977 /b/ 3.843 /Ã/ 4.758/l/ 2.217 /Ù/ 3.849


Table 3.3: Spanish Informativity-based information estimate

Phone Informativity Phone Informativity Phone Informativity/R/ 1.551 /l/ 3.042 /z/ 4.165/n/ 1.809 /m/ 3.156 /Ù/ 4.303/s/ 1.997 /N/ 3.389 /h/ 4.748/t/ 2.188 /p/ 3.445 /f/ 5.487/d/ 2.586 /w/ 3.484 /r/ 5.917/j/ 2.933 /b/ 3.543 /S/ 6.071/k/ 3.010 /g/ 3.570

As with binary comparison, the first step is to understand what the relevant effort-

increasing forces that cause the weakening in order to establish a comparison set are:

the phones that successfully resist the same functional pressure to lenite and delete.

For intervocalic positions the relevant comparison set is all oral stops: spirantization

and sonorization affect different subsets of oral stops in intervocalic environments: all

stops except glottal stop and emphatic stops in Biblical Hebrew (Gesenius, 1910), and

all voiced oral stops in Spanish (Harris, 1969). I follow Kirchner (1998) in treating

intervocalic lenition as reduction in effort (but see also objections to that approach

in Kaplan 2010).

Binary comparison makes no predictions regarding /t/- and /d/-weakening, since

they require less effort and have lower information utility than other oral stops. Real-

valued comparison predicts that if a phone has lower information utility in some

language than in other languages, it will be weaker in that language than in other

languages it may appear in. The information estimates of /t/ and /d/ in English are

lower than their information estimate in Spanish and Egyptian Arabic.15 Moreover,

the difference between the information estimate of /t/ and /d/ and the information

estimate of /k,p/, /g,b/ is greater in English (3.27) than it is in Spanish (3.28) or

Egyptian Arabic (3.29). Stochastic OT and MaxEnt OT adapted to fit MULE OT

15Spanish and Egyptian Arabic /t/ and /d/ are dental, but I assume that the associated effort isapproximately the same as their alveolar counterparts.


models are therefore more likely to rank faithfulness to /t/ and /d/ low enough that

they weaken more readily in English than in Spanish or Egyptian Arabic.

(3.27) English

Š Š ŠŠ ŠŠŠ Š

t k b

d p g

(3.28) Spanish

Š Š ŠŠ ŠŠŠ Š

t k b

d p g

(3.29) Egyptian Arabic

Š ŠŠŠ ŠŠ Š

t b g

d k

It is possible to test whether the low information estimate of English /t/ and /d/ is

the reason that leads to its greater likelihood to weaken, by using quantifiable models

of variation such as MaxEnt OT. MaxEnt OT models (Goldwater and Johnson, 2003)

can compare several different outputs, but in this case it is possible to use a simpler

model which contrasts just two alternatives: a word-final or intervocalic consonant is

kept or weakens. Reducing the number of possible outcomes to two allows us to use

logistic regressions, which estimate the ratio between the two forms. If two out of

every three word-final /t/s weakens in a particular environment then the weakening

ratio the regression would try to fit is 2 : 1 (see Bresnan and Nikitina 2009 for more

details on interpreting logistic regressions in the context of linguistic variation). Since

word-final deletion is easier to define and measure than intervocalic weakening, I used

a logistic regression to evaluate word-final deletion of obstruents in American English.

Methods and materials The model was evaluated using a mixed effects logistic

regression. The predicted value was whether a particular word-final consonant was

deleted or not. The comparison set chosen was word-final obstruents, as using only the

oral stops is not desirable in statistical modeling, since it would lead to overfitting

constraints such as place of articulation (three places of articulation or degrees of

freedom for just six phonemes). In order to make the environment more uniform,

the context was limited to post-vocalic, word-final obstruents that were followed by


a consonant in the following word. Function words were excluded since their word-

final deletion processes may be different than that of content words. Since Blevins

(2004, pp. 208–209) demonstrates that different manners of articulation are subject

to different pressures to delete, the phonological properties of each segment are used

as control variables, and model the manner-specific pressures to delete word-final

obstruents.

Word-final obstruents were collected from the Buckeye corpus (Pitt et al., 2007),

and an edit-distance program determined whether they were deleted. Each word was

assumed to have its CMU dictionary (Weide, 1998) representation. Words that did

not appear in the CMU dictionary were removed, leaving approximately 9000 data

points. Word counts were established using both the Buckeye corpus and the Switch-

board corpus (Godfrey and Holliman, 1997) in order to evaluate how predictable the

final consonant was in its context.

The baseline OT model was controlled for using a number of phonological and

phonetic features. The phonological features in Table 3.4 were included for the word-

final segment and for the first segment of the following word. Information utility at

the word-level was controlled for by including the negative logged probability of the

word in which the consonant appeared (when conditional probabilities are not used

and context is ignored, the logged frequency, the logged probability and informativity

of a linguistic element are all the same). The identity of the speaker and the word in

which the consonant appeared were used as random effects.

On top of the baseline model, the following information-utility constraints were

included in the model:

(a) The predictability of the obstruent given the context it appeared in (3.13; page

63), using all the previous phones in the same word as context. The value used

was the difference between the predictability of the segment and its informa-

tivity, as the original values are highly collinear (informativity is the expected

value of the predictability).

(b) The informativity of the obstruent, calculated as in (3.15; page 64), using all

the previous phones in the same word as context. The actual values can be

found in Table 3.2 on page 74.


Table 3.4: Post-vocalic pre-consonantal word-final obstruent deletion controls

Control Applies to Values

Place of articulation word-final segment labial, coronal, dorsalVoiced word final and following segments true, falseAffricate word final and following segments true, falseDental word final and following segments true, falsePalato-alveolar word final and following segments true, falseNasal following segment true, falseLiquid / Glide following segment true, falseRate of speech number of lexical

segments per second(logged)

Shares place of articula-tion with following seg-ment

word-final segment true, false

Syllable stress word-final segment primary, secondary, nostress

The model was fit using R (R Development Core Team, 2011). Due to the large

number of control variables, redundant variables were removed based on their AIC

(Akaike, 1974) using backward-elimination.

Results The full results are in Table 3.5 As expected, several phonological and

phonetic factors influenced the deletion rates of word-final obstruents. As the rate

of speech increased, speakers were more likely to omit word-final obstruents (p <

0.005). Obstruents at the coda of syllables that had primary stress were less likely to

delete (p < 0.01), but secondary stress was not significantly different than no stress.

Voiced obstruents were less likely to delete than voiceless ones (p < 0.001) and dental

fricatives were more likely to delete than other obstruents (p < 0.05).

Final obstruents in low-probability words were less likely to delete (p < 0.05),

signifying that information-utility operates at several levels, not just that of the seg-

ment. The relatively high p-value of word probability is due to the use of words as


a random effect. Refitting the model without including words as a random effect de-

creased the p-value of word probability (p < 10−8) (which means that word frequency

is very important).

Importantly, the informativity of the obstruent significantly influenced deletion

likelihood (p < 10−4). Highly informative segments were significantly less likely to

delete. No residual effect of contextual predictability of the segment was found, sug-

gesting that the choice to emphasize the role of informativity in evaluating information

estimates is justified. Additionally, it is interesting to note that the control variables

that treated stops, fricatives and affricates as different from one another did not im-

prove the explanatory power of the model and were removed from the model in the

backward-elimination process. The difference between stops and fricatives becomes

significant if informativity is removed from the model (p < 10−13), which means that

informativity modeled that difference between stops and fricatives in the actual best

model. Similarly, in the final model dorsals were not different than coronals, and

labials were more likely to delete than coronals (p < 0.005), despite the fact that

coronals delete more than both labials and dorsals, as shown in §3.7.4. The lack of

difference between dorsals and coronals, and the increased likelihood of labial deletion

also follow from the inclusion of informativity.

Table 3.5: Post-vocalic pre-consonantal word-final obstruent deletion fixed effects

Variable Estimate Std. Error z value Pr (> |z|)(Intercept) -1.12257 0.98621 -1.138 0.25501Primary stress -0.72394 0.28007 -2.585 0.00974Secondary stress -0.53362 0.56354 -0.947 0.34368Informativity -1.05090 0.24197 -4.343 1.41e-05Word-final is voiced -0.99143 0.28257 -3.509 0.00045Word-final is dorsal -0.13582 0.46377 -0.293 0.76963Word-final is labial 1.49029 0.45980 3.241 0.00119Word-final is dental 2.90575 1.20661 2.408 0.01603Rate of speech 0.46919 0.15641 3.000 0.00270Word frequency -0.09606 0.04489 -2.140 0.03236Following voiced -0.22264 0.13117 -1.697 0.08964Following affricate -0.99216 0.79798 -1.243 0.21375


Discussion As predicted by MULE, the experimental model shows that as the in-

formation estimate of a segment increases, speakers are less likely to yield to the

pressure to delete. The information estimate proved to be significant after controlling

for possible phonological factors, many of which have been less influential in predict-

ing word-final deletion. This finding demonstrates the advantage of MULE over its

alternatives in providing the right bias that would explain why in English /t/ and /d/

undergo multiple weakening processes, even though both segments appear in many

languages, are unmarked, require relatively little effort, and are usually very frequent.

3.6 The necessity of multiple scales and language-

specificity

3.6.1 The difference between MULE and current theories

The treatment of parallel weakening offered by MULE combines the universal scales

of effort and confusability with a language-specific scale of information utility. In this

section I show why any explanation for parallel weakening has to combine several

scales, and why at least one of those scales has to be language-specific.

3.6.2 Parallel weakening in standard OT

Current prominent theories in phonology cannot easily express what would make a

segment prone to undergo weakening in one language and not in others. While there

are clear ways to account for each and every existing process separately, expressing

proneness to undergo weakening is not trivial. For instance, the standard OT frame-

work (Prince and Smolensky, 1993; McCarthy and Prince, 1995) makes no prediction

with respect to the probability of weakening processes since the standard treatments

of markedness in OT allow targeting any subset of segments for weakening processes.

De Lacy (2002, §6.3) provides many examples of gapped inventories : languages in

which some subset of possible segments undergoes weakening. De Lacy assumes a

markedness hierarchy in which dorsals (K) outrank labials (P), which outrank coro-

nals (T), which outrank glottals (P). Some of the languages weaken only the most


marked or only the least marked segments, but some weaken only intermediately

marked segments.

In some sound systems such as US English only unmarked segments weaken:

/t/ weakens but /k/ does not. Such sound systems necessitate marked faithfulness:

constraints that are violated when marked segments are weakened, but not when

unmarked segments are weakened. Conversely, the selective weakening of marked

segments in marked environments in neutralization processes requires constraint con-

junction (Smolensky 1995 who cites Smolensky 1993) as in Ito and Mester (2003). A

conjunction of two markedness constraints outranks its conjuncts, but may outrank

or be outranked by faithfulness constraints.

An OT grammar that allows both constraint conjunction and marked faithfulness

can describe weakening processes that target any subset of segments. The following

tableaux demonstrate selective coda weakening. In (3.30) only dorsals lenite, in (3.31)

only labials lenite and in (3.32) only coronals lenite.16 All other combinations in the

power set of {P,T,K} are likewise possible.

(3.30) No dorsal codas

NoCoda Ident{K} Ident{P} NoCoda Ident NoCoda

tat.tap.tak &*K &*P &*T

tat.tap.tak *! * *

taP.tap.tak *! * *

tat.taP.tak *! * * *

+ tat.tap.taP * * * *

taP.taP.tak *! * **

tat.taP.taP * *! ** *

taP.tap.taP * * **!

taP.taP.taP * *! ***

16Max is not shown and outranks all other constraints, disallowing deletion.


(3.31) No labial codas

Ident{K} *NoCoda *NoCoda Ident{P} Ident NoCoda


tat.tap.tak * *! *

taP.tap.tak * *! *

+ tat.taP.tak * * * *

tat.tap.taP *! * * *

taP.taP.tak * * **!

tat.taP.taP *! * ** *

taP.tap.taP *! * **

taP.taP.taP *! * ***

(3.32) No coronal codas

Ident{K} Ident{P} *NoCoda *NoCoda NoCoda Ident


tat.tap.tak * * *!

+ taP.tap.tak * * *

tat.taP.tak *! * * *

tat.tap.taP *! * * *

taP.taP.tak *! * **

tat.taP.taP *! * * **

taP.tap.taP *! * **

taP.taP.taP *! * ***

The reason that every subset of segments can be targeted by weakening is that

currently, markedness in OT has a single universal scale: in every language marked-

ness and marked-faithfulness are identical and share the exact same order (in classical

OT) or constrain the same sets of elements (Kiparsky, 1994; de Lacy, 2002). There-

fore, an OT grammar with a single universal markedness hierarchy can always order

the faithfulness and markedness constraints targeting a particular level of marked-

ness independently of the relative ranking of constraints targeting another level of

markedness. Thus the markedness constraints forbidding dorsals can be ranked above

or below the faithfulness constraint preserving dorsals independently of the relative


ranking between the markedness constraint forbidding labials and the faithfulness

constraints preserving labials. Since both theories allow every subset of segments

to undergo lenition or deletion, they can make no predictions with respect to which

segments are more likely to undergo weakening.

3.6.3 No single scale can replace markedness

OT uses a single scale for both markedness and marked-faithfulness, and it is the

reliance on a single scale that brings about the lack of predictive power. It is important

to understand that it is very unlikely that any functionally motivated scale could

substitute for the use of markedness, even if that scale were allowed to be language

specific. The contrast between parallel weakening in Arabic and English demonstrates

that problem well.

In Arabic /q/ attracts multiple weakening processes and in English /t/ attracts

multiple weakening processes. It is quite difficult to find any single scale that would

apply similarly to /t/ and /q/. /t/ is unmarked cross-linguistically, pronounced using

the tip to the tongue and is rather frequent in English. /q/, on the other hand, is

marked cross-linguistically, pronounced using the back of the tongue and is relatively

infrequent in Arabic (it is not the least frequent segment either, as it is one of the

most frequent root radicals). I am not aware of any other functionally motivated scale

that would treat the English /t/ and the Arabic /q/ as more similar to one another

than to other segments that do not undergo weakening. Any of the proposed scales:

coronal/dorsal contrast, cross-linguistic frequency and in-language frequency would

predict that one segment should attract weakening and the other should not. OT’s

solution is to offer a configuration in which in some cases the high end of the scale is

targeted and in other cases the low end of the scale is targeted, but as I showed in the

previous section, that solution can describe any weakening pattern, and is therefore

not predictive or explanatory.

3.6.4 Universal scales do not suffice

Functional accounts in linguistics are often based on balancing two or more univer-

sal scales. The principle of least effort in Zipf (1949) argues that speakers attempt


to minimize their effort in articulation. But if that desire were not constrained by

some other functional force, speakers would arguably say nothing at all, preserving

all effort. Therefore, Zipf argues that speakers’ economy is bounded by their desire to

make themselves understood by their listeners. The speakers’ effort is therefore con-

strained by the listener’s capacity to understand what is said without putting in too

much effort. Similarly, several phonetically motivated theories in phonology (Flem-

ming, 2004; Boersma, 2003) balance articulatory effort with perceptual confusability.

Speakers’ attempt to reduce their effort may hinder the ability of the listener to tell

different segments apart, which limits the speakers’ effort reduction.

In theories that balance different functional forces such as the ones mentioned

above, it is easy to predict that a change that gains on all scales will always be

desirable, and a change that loses on all scales will always be undesirable. But

as Boersma (2003) shows, it is perfectly possible to gain on one scale while losing

on another, leading to a change that is neither desirable nor undesirable. Therefore,

such theories allow for multiple possible equilibria between the functional forces being

balanced. Since articulatory effort and perceptual confusability are based on human

physiology and psychology, the are expected to be universal, and have a similar effect

in all languages. Since multiple equilibria are available for every language to choose

from, each language may assign different importance to each functional force, leading

to different yet linguistically plausible outcomes in each language.

However, if all equilibria are equally plausible, language-specific patterns of par-

allel weakening cannot be predicted. Theories that are based on universal functional

forces would make similar predictions for languages that have similar contrasts among

segments. Languages that have a similar inventory of oral stops such as English, Span-

ish and Modern Hebrew would be expected to choose from a similar set of equilibria

when undergoing change. Spanish and Modern Hebrew would therefore be equally

likely to have a weakening-prone /t/ as English does, but this is not the case. Thus,

the theoretical solution for parallel weakening cannot rely exclusively on universal

forces.


3.7 Information-theoretic accounts

3.7.1 Information-theoretic explanations

MULE builds on several existent advances in using information theory (Shannon,

1948) to account for phonological weakening processes. In this section I review a

number of these accounts, showing their respective strengths and shortcomings, as

understanding their properties is an important step in understanding why I found

it necessary not to rely solely on information-theoretic accounts in order to account

for parallel weakening processes and why I focused on using informativity as the

core value of information estimates rather than on more commonly used information-

theoretic measurements such as frequency, predictability and entropy.

3.7.2 Functional load as entropy

Theoretical outline of functional load and entropy

Hockett (1955) proposes an information-theoretic approach to Martinet’s prediction

that languages would not collapse phonemic distinctions which would result in the

loss of too much information. The information-theoretic approximation was later

extended by Surendran and Niyogi (2006), who also showed that the such approxi-

mations of functional load make the wrong prediction with respect to a number of

complete neutralization cases.

The basic measurement in the quantification of functional load in Hockett (1955)

and Surendran and Niyogi (2006) is entropy. In a linguistic context the entropy of

a language is the expected (mean) predictability of each linguistic element given the

information that is already known to a listener. Consider the partial sentence in

(3.33):

(3.33) An ap. . .

The search engine I am using suggests that what I am really trying to search when

I type (3.33) is an apple a day, but other completions are certainly possible. If we

kept playing this game, guessing one word or one sound at a time, we would be able

to estimate how predictable each word is. If we average the predictability across all


the cases, the end product would be an estimate of the entropy of English. Shannon

(1951) applies this strategy to the evaluation of the entropy of characters in printed

English.

When a language distinguishes between two or more classes that are treated as

identical in other languages its entropy increases, as the guessing game becomes

harder. In a language in which there is no gender marking, it is easier to predict

what the next pronoun is going to be in a context such as (3.34a), since there is no

need to distinguish between (3.34b) and (3.34c).

(3.34) (a) I have seen the new professori but I haven’t talked to . . .

(b) I have seen the new professori but I haven’t talked to heri

(c) I have seen the new professori but I haven’t talked to himi

The quantification of functional load using entropy relies on the difference in entropy

between a language as it currently is, and a minimally different language in which

some distinction is eliminated from the language. The more the entropy of a language

drops by eliminating a distinction, the more important that distinction is, making it

unlikely that the language would lose that distinction.17 For example, the functional

load of the difference between /k/ and /g/ in English is estimated by deducting the

entropy of a slightly modified English in which every /k/ and every /g/ are replaced

by a single different phoneme, /σ/, from the entropy of real English.

Measuring entropy differences as an functional load has two important implica-

tions:

• It is not possible to evaluate the amount of information of a single linguistic

element, but it is possible to evaluate the information of the difference be-

tween linguistic elements. Changes that do not lead to the loss of information

(that do not collapse distinctions) have a zero cost. Many weakening processes

do not collapse distinctions, making functional load unable to predict which

information-preserving processes would be more likely to occur.

17Hockett (1955); Surendran and Niyogi (2006) also divide the difference by the entropy of theunmodified language, but since for a given language the division operation only scales the measure-ments, I omit it for simplicity’s sake.


• Collapsing two infrequent linguistic elements does not cost as much as collaps-

ing frequent ones. This is caused by counting observed events together with

unobserved events: each case in which a distinction is not lost counts towards

making that distinction unnecessary. In a language in which the ratio between

/t/ and /d/ and /k/ and /g/ is 3 : 1, and the ratio between /t/ and /k/ is

2 : 1, an entropy-based account would predict that collapsing /t/ and /d/ is

worse than collapsing /k/ and /g/. The predictability-based accounts presented

in the following sections predict that the distinction between /t/ and /d/ and

between /k/ and /g/ would be equally important.

As Surendran and Niyogi (2006) observed, functional load makes the wrong pre-

dictions in a number of cases. In the following section I present a corpus-based study

which shows that the same holds for the prediction of final consonant deletion in

English.

Applying functional load to English final consonant deletion

Introduction Deletion processes are a case in which functional load could come

into effect, since deletion may collapse two word forms together. Obligatory final

/t/-deletion could collapse walk and walked, Ben and bent and so forth. Obligatory

final /k/-deletion would affect fewer words, but would also collapse words: make and

may. I attempted to follow the predictions made by the functional load approach and

apply them to the case of final consonant deletion in English.

Materials and Methods It is possible to estimate which segment is more likely

to delete under the functional load account by measuring the difference between the

entropy of normal English, and the entropy of minimally different languages in which

only final /k/, only final /t/, only final /p/ and so forth have been deleted. I used word

counts from the Switchboard corpus (Godfrey and Holliman, 1997) and the Buckeye

Corpus (Pitt et al., 2007) and the representation of these words as it appeared in the

CMU dictionary (Weide, 1998). The entropy was evaluated using the frequency of

each word in Switchboard and Buckeye (a unigram language model). The entropy of

each language was evaluated as in (3.35).


(3.35)

−∑

word

log2

word occurrences

all word occurrences

Results The results are the differences in entropy between English and the language

in which each consonant was deleted word-finally. See Table 3.6, in which the values

are scaled by a factor of 100.

Table 3.6: Functional load of English with different final consonant deletion, scaledby a factor of 100

Phone Delta Entropy Phone Delta Entropy Phone Delta Entropy

/z/ 7.5248 /p/ 1.7043 /r/ 0.0458/n/ 5.2305 /k/ 1.3200 /N/ 0.0317/s/ 4.6838 /l/ 0.4695 /g/ 0.0275/v/ 4.4194 /þ/ 0.2652 /b/ 0.0202/d/ 3.4752 /f/ 0.1725 /S/ 0.0055/t/ 2.5521 /Ù/ 0.1477 /D/ 0.0027/m/ 2.5133 /Ã/ 0.0855 /Z/ 0.0006

Among the stops, functional load predicts that the greatest amount of information

would be lost if /d/ and /t/ are deleted word-finally, followed by /p/, /k/, /g/ and

/b/. Moreover, deleting word-final /d/s causes the loss of at least 170 times as much

information as the deletion of /b/s.

Discussion The predictions made by functional load are incorrect as /d/ and /t/

are more likely to delete word-finally than other stops. The reason for the incorrect

prediction lies in the way information is measured in functional load accounts. Since

unobserved events are as important as observed events, the fact that very few words

are conflated by the deletion of word-final /b/ or /g/ predicts that word-final /b/ and

/g/ could be deleted without a significant loss of information. Every time a word is

not conflated with some other word by /b/ and /g/ deletion counts towards making

final /b/ and /g/ redundant. The amount of information conveyed by the /b/s and

/g/s that do appear is drowned by the number of cases in which they do not appear.


3.7.3 Frequency

Zipf (1929) expects frequently used segments to weaken more than infrequent ones.

Zipf’s prediction is based on the expectation that when words, morphemes and seg-

ments are used more frequently than others, their articulation is under greater pres-

sure to become efficient. High efficiency would usually mean shorter, reduced and

requiring less articulatory effort. Following the terms used in this chapter, Zipf’s pro-

posal can be viewed as suggesting that frequent segments should be more prone to

undergo weakening than infrequent segments. This view is also expressed in Haspel-

math (2006, 2008), who relies on Zipf’s (1949) notion of frequency-dependent effort

to predict historical morphological change.

The intuitive prediction that frequent linguistic elements should require less effort

goes a long way in predicting weakening processes in American English. In the Buck-

eye corpus (Pitt et al., 2007) /t/ and /d/ are indeed more frequent than any other

oral stop, and they do delete more frequently word-finally and weaken in intervocalic

contexts. Chapter 2 shows that frequency goes deeper than that, as frequency also

successfully contrasts word-medial deletions in spontaneous speech for all oral stops,

as /k/ is more frequent and deletes more frequently than /p/ word-medially, and /b/

is more frequent and deletes more frequently than /g/ word-medially.

While intuitive and appealing, the prediction that frequent elements should weaken

more than infrequent elements has its shortcomings. First, frequency does not man-

age to capture the asymmetry between nasal stops: /N/ is less frequent than /m/,

but deletes more frequently word-medially. Second, it is not the case that the most

frequent segments are the first to weaken cross-linguistically. While in Egyptian Ara-

bic (Kilany et al., 1997) /t/ and /b/ are more frequent than other oral stops, it is

/q/, one of the less frequent oral stops, that undergoes regular weakening to /P/.

As discussed in §3.2.2, coronal-targeting weakening processes are a minority among

all weakening processes, but coronals are often among the most frequent segments a

language has, and appear in more positions than other segments.

Third, purely frequency-based accounts (by definition) do not take effort into

account, but rather expect effort to follow from frequency. Therefore, the proposed

Zipfian account makes sense only if two linguistic elements require the same amount

of effort and differ only by frequency. When both frequency and effort differ, an


effort-reduction account may well predict that the less frequent linguistic element

should be reduced. Under the assumption that the effort is weighted by frequency,

the combined effort can be approximated by multiplying a linguistic element’s effort

with its frequency. In example (3.36), σ1 and σ2 are segments, σ1 is two times more

frequent than σ2, and requires one third the effort required by σ2 to pronounce. σ2

requires greater weighted effort than σ1 as the effort associated with σ1 multiplied by

its frequency is smaller than the effort associated with σ2 multiplied by the frequency

of σ2. Therefore, σ2 would be under greater pressure to lenite, even though it is less

frequent.

(3.36) Segment Effort Frequency Weighted effort

σ1 e 2f 2ef

σ2 3e f 3ef

The amount of effort weighted by frequency therefore has an optimum that could

be violated by weakening the more frequent element. Zipf’s more general principle

of least effort does not predict that the more frequent segment would weaken first in

this case.

Finally, it is also possible that Zipf’s prediction which was exemplified using words

would not hold for highly frequent linguistic elements such as segments due to ceiling

effects on practice, as opposed to the contrast between frequent and infrequent words.

While it is true that frequent segments are articulated more frequently than infrequent

segments, a native speaker has had a chance to articulate infrequent segments a

substantial number of times, possibly enough to lead to optimized pronunciation for

all segments.

In summary, I argue that while frequency can be demonstrated to be a powerful

predictor of weakening processes, it cannot be used as an exclusive driving force in

actuating such processes, and has to be integrated with other functional forces such as

articulatory effort. One could argue that a minimally different version of MULE could

rely on frequency rather than on informativity in assessing information utility. The

differences between such theories should withstand empirical examinations similar to

the one performed in §3.5.3.


3.7.4 Predictability

Theoretical outline of predictability accounts

A third family of models uses predictability as a force that derives weakening and

other structural changes. A substantial amount of research demonstrates that pre-

dictable linguistic elements tend to have shorter duration than unpredictable ones as

in Jurafsky et al. (2001); van Son and Pols (2003); Aylett and Turk (2004); van Son

and van Santen (2005); Bell et al. (2009); Pluymaekers et al. (2005); Raymond et al.

(2006), to name a few. Hume (2004, 2008) suggests that high predictability leads to

instability which leads to likelihood to weaken. A segment that is more predictable is

therefore predicted to be prone to undergo weakening than less predictable segments.

Final /t/ and /d/ have been demonstrated to delete more often when they are pre-

dictable from context than when they are not, as in the data presented by Guy (1991,

table 1). Predictability-based accounts correctly predict higher deletion rates for the

/t/ of ‘kept’ than for the /t/ of ‘walked’ since the past form is already conveyed by

the strong-verb inflection of ‘keep’ in ‘kept’, making the /t/ that follows ‘kep–’ more

predictable.

The advantage of predictability over frequency is in shifting the focus from the

mechanics of communication alone to the goal of communication. Like frequency, pre-

dictability can be interpreted as reflecting on the psychological reality of language, for

instance if predictable linguistic elements are easier to retrieve than less predictable

linguistic elements. Unlike frequency, predictability can also be interpreted as an

abstract quantification of the information speakers try to transmit across a commu-

nication channel (Aylett and Turk, 2004; Levy and Jaeger, 2007; Jaeger, 2010). If

speakers and listeners share probabilistic knowledge about the language and the con-

text of the utterance, speakers may reduce the utterance time of predictable (and

therefore redundant) information. The reduction in utterance time can in turn lead

to weakening.

The criticism applied in the previous section with respect to Zipf’s prediction that

frequent segments should weaken before less frequent segments can be used to argue

against relying exclusively on predictability as well. The problem is in determining

the optimal point beyond which no further optimization is necessary. Some segments

are shorter and less salient than others due to their phonetic properties. When these


short segments are also predictable in some context, it might be that they should not

be further reduced.

Another issue is that predictability-based accounts expect speakers to strive to

achieve the most efficient communication in the level of each and every utterance:

an unpredictable /t/ should be kept while a predictable /p/ should be eliminated.

But this assumption is contradicted by the exceptionless properties of sound change

– since American English has an optional rule that deletes word-final /t/s, but does

not have an equivalent rule that deletes word-final /p/, unpredictable /t/s may be

deleted in cases where predictable /p/ would be kept. This prediction is tested in the

following section.

Predictability in American English final stop deletion and duration

Introduction Under predictability accounts segment deletion is motivated by its

redundancy. In other words predictable segments are deleted. If segment deletion is

motivated by that segment’s predictability alone, then predictable /t/s should delete

just as frequently as predictable /p/s and /k/s.

Methods and materials Post-vocalic word-final stops were collected from the

Buckeye corpus (Pitt et al., 2007), and an edit-distance program determined whether

they were deleted. Each word was assumed to have its CMU dictionary (Weide, 1998)

representation. Words that did not appear in the CMU dictionary were removed.

Word counts were established using both the Buckeye corpus and the Switchboard

corpus (Godfrey and Holliman, 1997) in order to evaluate how predictable the final

consonant was in its context. Cases in which the word-final stop was followed by a

homorganic segment in the following word were excluded. Since very few words had

final /b/ or /g/, all voiced stops were excluded as well.

The resulting set was further limited in one case to completely redundant stops

(that is, segments that are completely predictable from the preceding segments) such

as the /k/ in alcoholic, the /p/ in gossip and the /t/ in closet. In another case, the

resulting set was restricted to segments that were less than 1 : 25 likely to follow the

preceding context such as the /k/ in kick, the /p/ in hop and the t in bat. The latter

set will be labeled the surprising set in order to avoid confusion.


The number of segments that were transcribed as missing was compared with

the number of the segments that were transcribed as non-missing (though possibly

reduced). The resulting contingency tables were evaluated using Fisher’s exact test

(similar to χ2, but necessary because some of the cells have low counts). The null

hypothesis is that there is no difference between /k/, /p/ and /t/. Rejecting the

hypothesis would mean that the deletion ratios of the stops differ significantly.

Results

(3.37) Redundant (predictable) voiceless stops:

Stop Not deleted Deleted

p 29 4

t 195 12

k 128 4

The null hypothesis could only be rejected with marginal significance p < 0.1.

The deletion rates of the redundant voiceless stops /k/, /p/ and /t/ are marginally

different from one another.

(3.38) Surprising (unpredictable) voiceless stops:

Stop Not deleted Deleted

p 28 0

t 22 5

k 71 0

The null hypothesis was rejected with p < 0.0005. The deletion rates of the

surprising voiceless stops /k/, /p/ and /t/ differ from one another.

Discussion It is impossible to conclude that the three voiceless stops have differ-

ent deletion rates when they are completely redundant, though there is a trend that

suggests that /t/ deletes more than /p/ and /k/ in these cases. All redundant stops

delete, as predicted by predictability-based accounts. However, the difference emerges

among surprising (unpredictable) voiceless stops. While, as predicted by predictabil-

ity accounts, surprising /p/ and /k/ do not delete when they are surprising in the


context they appear in, /t/ does delete even when it is surprising, as follows from the

exceptionless properties of sound change. The difference between surprising stops is

not predicted by predictability-based accounts. It is therefore difficult to attribute

the difference in the deletion ratios of word-final stops to predictability alone.

Furthermore, a post-hoc analysis of deletion rates of /p/, /t/ and /k/ across

redundant (3.37) and surprising (3.38) conditions, shows that /p/ and /k/ do not

differ significantly across the two conditions (p>0.1) and (p>0.25) respectively, but

/t/ deletes more frequently when it is surprising than when it is redundant (p < 0.05).

That /t/ would delete more when it is surprising than when it is redundant is not

predicted by predictability-based accounts (but see Raymond et al. 2006 who found

a similar effect for word-medial /t/ and /d/ in coda positions).

One of the interesting differences between (3.37) and (3.38) is the frequency of /t/

in each of these tables. While among the redundant voiceless stops /t/ is the most

frequent, it is the least frequent among the surprising voiceless stops. If speakers were

recording how redundant a segment is across all contexts, they could easily note that

/t/ is more frequently redundant than surprising. This observation lies at the heart

of informativity accounts, discussed in the following section.

3.7.5 Informativity accounts

In an attempt to account for the exceptionless tendencies of word-medial consonant

deletion, in chapter 2, I use a different information-theoretic approach. Rather than

rely on the actual predictability of a segment (3.39), speakers assess how important

a segment is by relying on how predictable it is on average (using the expected value

of the predictability), their informativity (3.40).

(3.39)

− log Pr (segment|context)

(3.40)

E (− log Pr (segment))


In information theory (Shannon, 1948) predictability is equated with providing

little unknown information, and being unpredictable means providing more informa-

tion. A segment that is usually predictable is therefore said to have low informativity

and a segment that is usually unpredictable has high informativity. The prediction

is that segments that are less informative would lenite more than segments that are

highly informative everything else being equal.

This prediction is borne out in American English, as /t/ is the least informative

oral stop, and /d/ is the second least informative oral stop. Like frequency, informa-

tivity captures the asymmetries between /k/ and /p/ and between /b/ and /p/ (see

§3.7.3) but also manages to capture the asymmetry between /N/ and /m/: /N/ is less

informative and deletes more than /m/, despite being less frequent than /m/. Pi-

antadosi et al. (2011) show that word informativity approximates word length better

than word frequency in a range of languages.

The main problem with predicting weakening using informativity alone is that

a weakening process may be biased to target informative segments, not only unin-

formative ones. Calculating the informativity of segments in Egyptian Arabic using

the LDC Egyptian Colloquial Arabic Lexicon (Kilany et al., 1997) shows that /q/

– a target of many weakening processes in Arabic dialects – is one of the more in-

formative segments in Egyptian Arabic (see Table 3.1 on page 71). The problem

is not Arabic-specific and is likely to reoccur whenever informativity is applied to

infrequent segments, since informativity is highly correlated with logged frequency:

frequent segments are more predictable, other things being equal. Like frequency,

informativity has to be evaluated while taking effort into account.

3.7.6 Why information-theoretic accounts do not suffice

The fundamental problem shared by frequency, predictability and informativity ac-

counts is the existence of optimal amounts of weakening that should not be exceeded.

It is possible to weaken an element beyond what its frequency, predictability or in-

formativity would predict. Predictability and informativity accounts (and some fre-

quency accounts) rely on covert or overt assumptions about either recoverability or

redundancy avoidance (Aylett and Turk, 2004; Levy and Jaeger, 2007). The recov-

erability of predictable elements is greater than the recoverability of unpredictable


elements, other things being equal, since predictable elements make a “better guess”

than unpredictable ones (Haspelmath, 2008, §6.1). Likewise in information-theoretic

terms predictable elements provide less information than unpredictable elements, and

are therefore more redundant, other things being equal. However, if such pressures

have always existed, they should already be reflected in today’s sound systems: pre-

dictable and uninformative segments would already have weaker cues and shorter

durations before any new weakening process applies. In the Buckeye corpus (Pitt

et al., 2007) the mean duration of onset /t/s that surface as [t]s (/t/→[t]) is shorter

than the mean duration of /p/→[p] and /k/→[k], and the mean duration of /d/→[d]

is shorter than the mean duration of /b/→[b] and /g/→[g]. In this setting, /t/ should

only weaken more if it is not short enough. Beyond that point there is no functional

reason for information-theoretic accounts to predict weakening.

In summary, information-theoretic accounts do not suffice for predicting parallel

weakening processes. The inability of information-theoretic accounts to predict which

segments would be prone to undergo weakening is evident in the fact that segments

that undergo weakening can be frequent, predictable and have low informativity such

as English /t/, but can also be infrequent, unpredictable and have high informativity

such as Arabic /q/. However, using information-theoretic measurements to account

for the functional pressure to preserve information and withstand effort-avoidance

does have predictive power, as the experiments discussed in this chapter have shown.

3.8 Variable deletion rates of stems and affixes

3.8.1 The contrast between American English and Puerto

Rican Spanish

MULE uses information utility as a functional force that motivates the preservation

of linguistic elements. It therefore faces similar criticism as other functional accounts.

Labov (1994, ch. 19), for instance, criticizes functional approaches on a case-by-case

basis.

One of Labov’s concerns is the contrast between the case of final /t/-deletion in

American English as described in Guy (1991) and the case of final /s/-deletion in


Puerto Rican Spanish as described in Poplack (1980). In American English word-

final /t,d/-deletion varies with respect to the functional properties of the /t/ and /d/

in question: stem-final /t,d/ delete more frequently than past-morpheme /t,d/. For

semi-weak verbs in which both the stem changes and an affix is added such as ‘kept’,

deletion rates are higher than for the past morpheme, but lower than for stem-final

/t,d/. In Puerto Rican Spanish stem-final /s/ deletes less frequently than affixed /s/.

Additionally, among /s/ affixes, the deletion rates of plural markers vary – plural

markers on nouns delete less frequently than plural markers on adjectives. Labov

claims that functional arguments should predict the same pattern of weakening for

American English and Puerto Rican Spanish. Instead, in American English affixed

/t/ deletes less frequently than stem-final /t/, but the pattern is reversed in Puerto

Rican Spanish, in which stem-final /s/ deletes less frequently than affixed /s/.

I will not address all the concerns raised in Labov (1994), but I will show why

MULE does not predict an identical pattern for affix vs. stem deletion patterns in

different languages, and constrast the predictions made by MULE in the English and

Spanish cases.

In MULE the actuation of segment-deleting weakening processes is motivated by

balancing effort with information utility. For both the American English final /t/-

deletion case and the Spanish final /s/-deletion processes, the segments in question are

the same across all conditions (/t/ in American English and /s/ in Spanish). There is

therefore no need to take effort into account, and information utility has to account for

the variable deletion rates across the different conditions. Up until now, I considered

two sources for information utility – the information of the segment (modeled using

informativity and local predictability) and the information of the word (modeled using

word frequency). It is not unreasonable to assume that different morphemes carry

their own information. Under this proposal, speakers assign information to stems,

past tense markers and plural markers.

I adopt here a simple approach for combining the different sources of information

that contribute to the amount of information a linguistic element holds. I assume

that the information a linguistic element holds is a weighted sum of all the sources

contributing to its information. The information estimate of a /t/-morpheme is there-

fore a weighted average of the information the phoneme holds and information the


morpheme holds.

In order to establish whether stem-final /t/ holds more information than the past

morpheme /t/ in American English, the information estimate of the past tense mor-

pheme has to be evaluated across the usage patterns of American English. Similarly,

for each stem, the information estimate of the stem should be calculated and divided

among its segments. For instance, the information estimate of the /t/ in ‘walked’

will be estimated along the lines of (3.41), and the stem-final /t/ in the word ‘just’

would be estimated along the lines of (3.42). In the following examples ws is a weight

assigned to segment information estimate, and wm is a weight assigned to morpheme

information estimate.

(3.41)

ws · info-estimate (/t/) + wm · info-estimate (past tense morpheme)

(3.42)

ws · info-estimate (/t/) + wm ·info-estimate (the word ‘just’)

4

The question is therefore whether the information estimate of the /t/ in a particular

word is bigger or smaller than the information estimate of past tense /t/.

The study in §3.5.3 showed that as the amount of information contained in a

particular word increases, so do the odds of avoiding word-final deletion in that word,

in line with the expectation that a high word information estimate would decrease

the likelihood of deleting a segment in that word. However, it is still necessary to

establish how much information the past morpheme encodes and compare it to other

words in English. In order to compare the deletion rates of the English past tense

morpheme to the Spanish plural markings, it is necessary to establish how much

information the Spanish plural morpheme encodes.

3.8.2 The information of English verbal -ed morpheme

Introduction The goal of this study is to establish the amount of information

contained in the past tense morpheme in American English so that it can be compared

with the amount of information contained in individual words.


Methods and materials In order to estimate how much information the past

tense morpheme encodes, it is necessary to establish how predictable it was in the

context it appeared in. In order to do that, I used -ed suffixes that appeared in

words that WordNet (Miller, 1995) listed as exclusively verbs. The word walked is

listed exclusively as a verb and was therefore included, but fixed is also listed as an

adjective, and was excluded from the estimates. The estimate uses the two preceeding

words and the stem (walk for walked) as context, and evaluates how unpredictable

the /-ed/ suffix was in the context it appeared in. The list of the three words and

their counts was taken from the Google Web 1T 5-gram Corpus corpus (Brants and

Franz, 2006), based on the 3-gram files (three words and their counts across the web).

For the sample data in (3.43), the calculation was (3.44). The information estimate

of the past tense morpheme was estimated to be the weighted average of all the cases

in which it appeared (its informativity).

(3.43)

Data Context Suffix count

outside and walk outside and walk ∅ 3450

outside and walked outside and walk ed 1523

outside and walking outside and walk ing 522(3.44)

− log2 Pr (-ed|outside and walk) = − log2 Pr1523

522 + 1523 + 3450

Results The information estimate of the -ed morpheme using a two word and stem

context was 9.952 bits. By comparison, the information estimate of /t/ is only 1.526

(calculated as in §3.5.3 (table 3.2).18 Had the -ed suffix been a separete word, it

would have been the ≈130th most common word, well below most of the function

words, as well as a number of content words such as go, say and people.

Discussion It is highly unlikely that a distribution of information across a mor-

pheme would lead to a word-final stem /t/ having a higher combined information

estimate than a /t/ that appears as part of the -ed suffix. Even words that appeared

18The information estimates for English segments was evaluated using a unigram model, whichmeans that the model could not use context that may be available to speakers, yielding higherinformation estimates than in a trigram model like the one used in the current study.


only once in the corpora used in §3.5.3 had an information estimate of 21.427. Fol-

lowing the assumption that the information estimate of a morpheme is distributed

between the linguistic units that comprise it, a word-final stem /t/ in those rare words

would have to get at least 9.952/21.427 = 0.464 or more than 45% of the amount of

information that the entire stem contains in order to resist deletion better than the

-ed suffix.

Under the assumptions that the information of a linguistic element (a word in this

case) is a weighted average of the information of its parts, and that the information of

morphemes is distributed among the segments that comprise the morpheme, MULE

predicts that an English stem-final /t/ would delete more than a /t/ in the -ed suffix.

The question that remains is therefore whether MULE has different predictions for

the Spanish plural -s suffix. In the following section I will attempt to answer this

question by measuring how much Spanish plural -s morpheme holds.

3.8.3 The information of Spanish plural -s morpheme

Introduction Paralleling the previous study, this study aims to establish the amount

of information contained in Spanish plural /-s/ and compare it to the amount of in-

formation contained in individual words.

Methods and materials The method used in this study is similar to the one used

in the previous study, with the following differences. Parts of speech were assumed to

be the ones used in CALLHOME Spanish Lexicon (Garrett et al., 1996). Data was

collected separately for nouns and adjectives. Predictability in context was established

using the Google Web 1T 5-gram, 10 European Languages corpus (Brants and Franz,

2009).

Results The information estimate of the -s morpheme using a two word context

was 1.856 for the plural morpheme of adjectives, and 2.652 for the plural morpheme

of nouns. By comparison, the information estimate of /s/ in Spanish is 1.997.19 Had

either of the Spanish plural -s adjectival and nominal morphemes been words, they

would have been the most frequent words in Spanish.

1930% higher than the information estimate of /t/ in English, but still the third lowest in Spanish.


Discussion The information estimates of the nominal plural morpheme is higher

than the information estimate of the adjectival plural morpheme. Everything else

being equal, the estimates predict the likelihood to preserve the plural -s marker as

reported by Poplack (1980).

Additionally, the information estimates of all plural -s in Spanish (< 3 bits of infor-

mation) are significantly lower that the information estimates of the English -ed verbal

suffix (≈ 10 bits of information). Since the information estimates of the different plu-

ral -s are significantly lower than every word in Spanish, there are many more ways

to distribute the information of the entire word across the segments that comprise it

and still have higher information estimate than that of the plural -s markers. The

median information estimate in the CALLHOME Spanish Lexicon corpus is 20.137.

The ratio between the information estimate of nominal plural morpheme (2.652) and

the median information estimate for words in Spanish is 2.652/20.137 = 0.132. There-

fore, if the stem-final /s/ shares more than 14% of the information estimate of the

stem it is part of, it can resist weakening and deletion better than the nominal plural

morpheme -s.

Though the data does not entail that Spanish would have the reversal of the

English case, which would reflect lower likelihood of stem s-deletion than morpheme

s-deletion, it does suggest that such a reversal is likely. Moreover, the correct order

of deletion likelihoods between nouns and adjectives is predicted.

3.8.4 MULE’s predictions are measurable

Labov (1994) argues convincingly that functional explanations for variably applied

rules such as word-final /t-deletion/ in American English should hold cross-linguistically.

He argues that if a linguistic asymmetry such as the asymmetry between the deletion

of morphemes vs. stems is reversed across languages, the asymmetry cannot be used

as an argument for variable application of a linguistic rule. It is important for any

account that is based on functional forces to withstand such criticism, and in this

section I showed that MULE sucessfully does so.

In this section I showed that (a) MULE does not make the same predictions for

cases of word-final /t/-deletion in American English and word-final /s/-deletion in

Spanish, (b) that MULE does suggest that the deletion of past tense /-t/ morpheme


in English erases more information than the deletion of plural /s/ in Spanish, and

(c) that MULE correctly predicts that plural markers on Spanish nouns would delete

less frequently than plural markers on Spanish adjectives. Importantly, I sketched

a principled method for testing the predictions made by MULE in other cases of

variably applied rules cross-linguistically.

3.9 Conclusion

In this chapter I presented a new theoretical approach, MULE. MULE integrates

two well-known ideas in linguistic theory. The first claims that speakers’ linguistic

behavior is influenced by the amount of information linguistic elements contains,

and the second that speakers attempt to reduce their articulatory effort while still

providing their auditors perceptual cues to keep linguistic elements distinct. In MULE

the information utility of linguistic elements motivates keeping linguistic elements

distinct from one another and justifies the expenditure of effort to achieve that goal.

MULE has several implications, but this chapter focused on one – predicting

the distribution of weakening processes in language, addressing Weinreich et al.’s

(1968) actuation problem with respect to weakening processes. MULE predicts that

when the information utility of linguistic elements is not high enough to justify the

expenditure of effort that is required by its perceptually distinct pronunciation such

elements would be under a pressure to weaken. MULE does not predict at what time

a pressure to weaken would evolve to an actual weakening process, nor what output

will be chosen for the weakening process (e.g. spirantization or tapping), but it does

predict which weakening processes are licensed in a given language at a given time.

As information utility promotes the preservation of linguistic elements, it lends

itself to be used as an approximation of faithfulness in OT. Likewise, effort can be used

to approximate markedness. I used OT implementations of MULE to account for the

language-specific distribution of weakening processes in American English and various

dialects of Arabic. The prediction of language-specific weakening processes pulls the

actuation of weakening processes from the extra-linguistic “wastebasket” (pragmatics

is just one of those) and reopens such processes for investigation in linguistic theory.

I presented several data-oriented studies that support MULE. I used a regression


study that showed that word-final obstruent deletion in American English is sensitive

to the informativity of the final obstruent, more so than to phonological factors such

as place of articulation, and information-theoretic factors such as predictability. I also

showed experimentally that a functional load approach to the same problem yields

wrong predictions. Similarly, I demonstrated that word-final deletion in American

English necessitates the use of informativity rather than predictability, as unpre-

dictable /k/ and /p/ deleted less word finally than unpredictable /t/. Together those

studies show that balancing effort and information estimates is more predictive than

other accounts.

While this chapter tackles only weakening processes, MULE is by no means limited

to predicting weakening. MULE’s key insights, that information is always useful, and

that balancing information with effort is crucial for the correct prediction of linguistic

phenomena, can be used to account for other linguistic phenomena in phonology,

morphology, syntax and psycholinguistics.

In phonology, MULE has several predictions that have not been discussed in this

chapter. First, MULE predicts that the linguistic elements in prominent positions

(prevocalic positions, stressed syllables) will provide more information than those that

are located in less prominent positions (codas, unstressed syllables). Like language-

specific weakening, such phenomena would be difficult to explain in other frameworks.

Second, phonological processes that result in added effort for perceptually motivated

reasons such as epenthesis and fortition are also expected to follow from an attempt to

preserve information (and add as little superfluous information as possible). Finally,

MULE provides a framework that can explain how a language that has a number of

coda-deletion processes may eventually lose all codas (Blevins, 2004), as any language

that severely restricts the inventory of segments that can appear in coda positions

decreases the information content in codas, which may eventually lead to the elision

of all codas.

All of MULE’s predictions apply to morphology and syntax in much the same way

as they apply to phonology. Affixation and cliticization are expected to follow from

low information content, and information contentful affixes and clitics are expected

to require better perceptual cues and allow a greater expenditure of effort than those

that provide less information, extending Haspelmath (2008). However, MULE has


an additional prediction that can be tested empirically. Information preservation in

MULE focuses on the amount of information speakers estimate linguistic elements

contain, based on exposure to such elements. Therefore MULE assigns information

only to observed linguistic elements. As such, unobserved linguistic elements such

as zero affixes and other inaudible elements would not be assigned information util-

ity. Therefore, MULE predicts an asymmetry between observed (marked) linguistic

elements and unobserved linguistic elements. A word with no plural agreement in En-

glish will be regarded as lacking plural marking, not as being zero-marked as singular.

Evidence for the predicted asymmetry may come from psycholinguistic experiments

or from environments in which an element cannot be marked for some property (mass

nouns and prototype readings for noun plurality, missing subject for verb agreement

etc.).

For psycholinguistics, MULE has several different predictions than those assigned

by state of the art information theoretic accounts. First, MULE focuses on the

speaker, not on speaker-listener interaction. It therefore expects speakers to follow

its prediction (putting more effort into elements with high information utility) even

in the absence of a listener. Second, MULE assumes that speakers manipulate their

articulatory effort in response to the information utility of linguistic elements, while

several other information-theoretic accounts (Aylett and Turk, 2004; Levy and Jaeger,

2007; Jaeger, 2010) assume that speakers manipulate articulation time. The difference

is expected to emerge in linguistic elements that require longer articulation time to

begin with, and in slowed-down language production such as typing. Cohen Priva

(2010) showed that frequency effects do appear in typing, and it would be interesting

to see whether predictability and informativity effects emerge in typing as well.

Chapter 4

Lexicon, usage and information

4.1 Introduction

Chapter 3 showed that the balance between effort and information utility can lead to

the actuation of weakening processes. Segments whose information utility is not high

enough to justify the expenditure of effort that is required by their perceptually dis-

tinct pronunciation will be under a pressure to weaken. Chapter 3 focused on weaken-

ing processes and treated articulatory effort and perceptual distinctness as one factor.

But the three-way interaction between information utility, perceptual distinctness and

articulatory effort can take other forms. Perceptually prominent positions make ideal

locations for the placement of segments whose information utility is high. By placing

highly informative segments in perceptually prominent positions, a language can make

it easier to transmit the information carried across a communication channel. Any

positive correlation between perceptual prominence and information utility would in-

dicate that information utility affects not only performance-related phenomena such

as segment duration and deletion (chapter 2), or change and competence-related phe-

nomena such as the actuation of weakening processes (chapter 3), but also the lexicon

and usage patterns of language.

This chapter focuses on stressed syllables, a perceptually prominent position

(Beckman, 1998; Steriade, 1997; Smith, 2002; Giavazzi, 2010, among others). Stressed

syllables (stress domains in Giavazzi 2010) benefit from several phonetic differentia-

tions from unstressed syllables. They are louder, longer and their vowels have greater

105

CHAPTER 4. LEXICON, USAGE AND INFORMATION 106

sonority. In Beckman (1998) and Smith (2002) phonetic prominence is taken to be an

inherent property of the stressed syllable. One consequence of phonetic prominence is

that stressed syllables potentially exhibit more contrasts and resist phonetic neutral-

ization better than less prominent positions (Beckman, 1998). Phonetic prominence

also leads to phonological prominence effects, such as high nucleus sonority and low

edge sonority (Smith, 2002). For Giavazzi (2010) the goal of the phonetic differ-

entiation from unstressed syllables is to make stressed syllables more phonetically

prominent than unstressed syllables. Speakers put in this additional effort in order

to preserve the metrical structure of the language (Hayes, 1995).

For all current phonological accounts the phonetic prominence of stressed sylla-

bles is not conditioned on the syllable’s content nor on the information utility of its

components. However, since stressed syllable vowels allow more contrasts and block

neutralization processes, it is not surprising that stressed syllables tend to be longer

and provide more information about a word’s identity (Piantadosi et al., 2009). The

question posed in this chapter is a different one: While stressed syllable nuclei tend

to have more contrasts than unstressed syllables (opposite cases exist), it is rarely the

case that stressed syllable onsets allow more contrasts. Beckman (1998) mentions a

single case of neutralization in the onsets of stressed syllables. Giavazzi (2010, §3.4.2)

argues that such processes are more common, but the two cases she provides (from

Italian and Finnish) block neutralization following a stressed syllable. Therefore,

there are few phonologically grounded reasons to believe that the onsets of stressed

syllables hold more information than onsets of stressed syllables. However, if the

onsets of stressed syllables are more perceptually prominent for metrical reasons, lan-

guage could make use of that fact and preferentially place highly informative segments

in those perceptually prominent positions.1 Adams et al. (2009) show that this pre-

diction holds for German – highly informative segments are more likely to be found

in the onsets of stressed syllables. The goal of this chapter is to ascertain whether

the prediction holds that highly informative segments would be preferentially placed

in the onsets of stressed syllables than in the onsets of unstressed syllables.

1Chapter 2 shows that onsets of stressed syllables in American English tend to have longerduration and are less likely to delete.


I test the relationship between stressed syllables’ phonetic prominence and infor-

mation content in three languages, after controlling for possible phonological expla-

nations. I show that in American English, Egyptian Arabic and Spanish, information

utility is positively correlated with the likelihood to appear in stressed syllables. This

correlation shows that languages do indeed make use of stressed syllables’ prominence

to improve the transmission of information, and that information utility does have an

effect on the lexicon and the usage patterns of language. As Giavazzi (2010) predicts,

phonological factors such as place of articulation are not consistently correlated with

stressed syllables – even if they do increase the likelihood to appear in a stressed

onset in some languages, that tendency disappears or is even reversed in other lan-

guages. In fact, with the exception of stop / fricative contrast, no phonological factor

is consistently correlated with positional prominence.

The finding that highly informative segments are preferentially placed in the on-

sets of stressed syllables has consequences outside of phonology by showing that in-

formation theoretic factors shape the lexicon and usage patterns of language. Like

similar studies (Piantadosi et al., 2009, 2011), the fact that the lexicon of languages

is adapted to improve the transmission of information cannot be regarded as a by-

product of linguistic performance. I discuss this consequence at length in §4.4.

4.2 Methodology and sources of data

4.2.1 The choice of test cases

In the three studies described below I investigate whether American English, Egyptian

Arabic and Spanish preferentially place highly informative segments in the onsets of

stressed syllables as German does. The reason that any one language does not suffice

is that any one language may prefer to place segments of some kind in stressed

syllables. It is only if several languages behave the same way that the argument can

be convincing. If a property promotes prominence in one language but not in others,

it should be regarded as language-specific.

The choice of stressed syllable onsets rather than other perceptually prominent

environments (Beckman, 1998) attempts to sidestep the possible reversal of the causal

relationship between prominence and information. Some prominent positions tend


to have high information content due to the organization of language. Languages

tend to have more roots than affixes, and therefore everything else being equal roots

(or at least root-initial positions) are less frequent, less predictable and hence more

informative than affixes. Similarly, stressed syllables tend to allow more vowels than

unstressed syllables, and on similar grounds will tend to have more information than

unstressed syllables. For the onsets of stressed syllables no reversed-causality exists.

Additionally, in order to have relatively uniform conditions for the corpus studies I

focus on intervocalic segments, which may be followed by a stressed vowel (prominent)

or unstressed vowel (less prominent).

Languages in which a stress / unstressed contrast can be tested must fulfill several

conditions. First, they must have a stress system. In addition, there must exist pho-

netically annotated spoken corpora that include stress-related data and an estimate of

word frequencies. Finally, the languages chosen should have relatively different phono-

tactics and phonological constraints from one another. The languages I chose were

Spanish, for which phonetically annotated corpora were available, the CALLHOME

Spanish Lexicon (Garrett et al., 1996), and Egyptian Arabic, for which a similar

corpus exists, the LDC Egyptian Colloquial Arabic Lexicon (Kilany et al., 1997). I

generated similar data for English, using the CMU dictionary (Weide, 1998) for pho-

netic annotation and the Switchboard (Godfrey and Holliman, 1997) and Buckeye

(Pitt et al., 2007) corpora for word counts.

4.2.2 Statistical models

For all three models presented in this chapter, the following choices were taken. All

models used logistic regressions in which the predicted variable was whether the

segment appeared in a prominent environment (the ratio between prominent and

non-prominent environments). Every occurrence of every segment in every word was

considered, provided that it appeared in intervocalic contexts. Since the model tests

for usage preferences, each data point was weighted by its frequency in the language.

Thus, a segment in a very frequent word was taken to be more important than a

segment in an infrequent word.

First, a purely phonological model was evaluated to establish a baseline. Giavazzi

(2010) convincingly shows that the range of processes that apply to the edges of a


stressed domain (the consonants that precede and follow the stressed nucleus) is very

limited, and can apply solely to features that are affected by the increase in sub-glottal

pressure in stressed syllables (and therefore affect not only the onsets of stressed

syllables but also the onsets of the following syllables). Giavazzi lists the features

spread glottis, constricted glottis, voicing, continuant, strident and delayed release

as sensitive to stress. Additionally, phonetic features such as consonant duration

and VOT are typically affected by stress. Stress is not supposed to affect features

such as place of articulation or nasalization of consonants. Not all of these effects

affect the onset of stressed syllables. Assibilation of Finnish /t/ in verbal affixes is

prevented in the onsets of unstressed syllables following the stressed syllable (Anttila,

2006). I did not exclude from the model phonological features which are not predicted

to be affected by stress. Previous accounts have described prominent positions as

licensing position-specific markedness constraints, and could therefore predict that

other marked values such as non-coronal place of articulation might also be more

frequent in stressed syllables.2 The pure phonological model might therefore be able

to distinguish between predictions made by an extension of the Beckman (1998)

account and Giavazzi (2010) account.

The baseline phonology models were selected using the step() function (Hastie

and Pregibon, 1992; Venables and Ripley, 2002) in R (R Development Core Team,

2012) using forward / backward search. The search used the variables in table 4.1

and the distance of the segment from the beginning of the word and from the end of

the word, measured in segments, and logged.3 Distance from word-edge is supposed

to provide a rough approximation for a language’s preference to place stress word-

finally or word-initially, and reduce the correlation between the amount of information

a segment holds and its distance from the word’s edges. The emphasized factors are

the ones that are predicted by Giavazzi (2010) to be affected by stress.

The logistic regression model is of the same family of models as MaxEnt OT

(Goldwater and Johnson, 2003). It allows its constraints to “gang up”: a strong

constraint can be overcome by several weaker constraints. Therefore, it uses a weak

2Beckman (1998) listed stressed syllables as a prominent position, but did not consider the onsetsof stressed syllables as a prominent environment in their own right. Subsequent accounts (Smith,2002; Giavazzi, 2010) discussed onset-specific effects.

3Notice that the variables in table 4.1 are not pure phonological variables in order to reducecollinearity. Only oral stops are ‘stops’, only voiced obstruents are ‘voiced’.


Table 4.1: Segment phonological properties∗

Variable Value Segments Comment

Place glottal /P,h/

coronal /t,d,D,T,s,z,>tS,

>dZ,l,r,R,ô/

labial /p,b,f,v,w/dorsal /k,g,x,ñ,j,K/radical /q,è,Q/ Standard Arabic /q/ is listed as rad-

ical following Kilany et al. (1997)

Strident binary /s,z,S,>tS,

>dZ,sG/

Palatal binary /j,S,>tS,

>dZ,ñ/

Interdental binary /T,D,DG/

Voicing binary /b,d,g,v,z,>dZ,dG,DG/

Glide binary /w, j/Liquid binary /r, R, ô, l/Stop binary /p,t,k,b,d,dG,g,q,P/ includes Spanish spirantized stops

Affricate binary />tS,

>dZ/

Emphatic binary /sG,dG,tG,DG/ in Arabic

∗Emphasized factors are the ones Giavazzi (2010) expects to be affected by stress domains

form of conjunction of constraints: violating two lower-ranked constraints may be

worse than violating a higher-ranked constraint, as opposed to standard OT Prince

and Smolensky (1993). However, no specific interaction terms were introduced into

the model, and the model cannot fit the conjunction of constraints independently.

Preference for voiceless stridents will surface as a product of preferences for stridents

and voiceless obstruents, and not as an independent feature.

After a phonological model was established, I added two information theoretic

variables. First, I added the negative log predictability of the segment given all the

preceding segments from the beginning of the word, following van Son and van Santen

(2005) (see chapter 2). This measurement assesses how much information the listeners

gain by understanding what the segment is, given that they know what the preceding


segments in the same word are. Segments that are less predictable in the context they

appear in provide more information than segments that are predictable in context. In

the cherry-picked examples in (4.1), redundant (completely predictable from context;

they provide 0 bits of information) /p,t,k/ appear in the onset of stressed syllables. In

contrast, /p,t,k/ in (4.2) are very informative (unpredictable from context < Pr(2−7);

they provide > 7 bits of information). The words in (4.1) are often reduced so that

they do not include the redundant stressed onsets (euro, rep, app) while such processes

do not affect the words in (4.2).

(4.1) Redundant /p,t,k/ (= 0 bits):

European (/p/), reputation (/t/), application (/k/)

(4.2) Informative /p,t,k/ (> 7 bits):

capacity (/p/), eternal (/t/), undercover (/k/)

Additionally, I added the average (expected) value of the segment’s predictability

across the entire language. This is the segment’s informativity (see chapter 2). The

two variables are collinear, and I therefore residualized the contextual predictability

using informativity (the average value of contextual predictability). Adams et al.

(2009) found that informativity significantly predicts the distribution of onsets in

German, but did not control for predictability. If informativity affects the distribution

of consonants in the onset of stressed syllables, it would mean that segments that are

less predictable across the board are more likely to be found in the onsets of stressed

syllables, regardless of the amount of information they provide in that particular

context.

In contrast to change processes which seem to affect every segment of a certain

type (all word-final /t/s), splitting the informativity and local predictability in the

case of positional prominence has no pre-theoretic justification. If language is sensitive

to segment identity when placing highly informative segments in prominent positions

(it matters that the /p/ in the word apart is a /p/), segment informativity is a

better approximation of the amount of information a segment carries. However, if

language is not sensitive to the identity of the segment in question (the /p/ in apart

just happens to be a /p/, only the amount of information it carries matters), then

segment predictability is a better approximation for the amount of information a


segment carries. The answer to this question can be answered experimentally, and

across a number of languages.

4.3 Studies

4.3.1 American English

Method and materials The study follows the outline listed in §4.2.2, with the

following adjustments. Word counts were collected from the Buckeye (Pitt et al.,

2007) and Switchboard (Godfrey and Holliman, 1997) corpora. The CMU dictionary

(Weide, 1998) was used to provide each word with its phonology. Only intervocalic

coronals, labials and dorsals were used for the study. The preceding vowel could

have any stress, but the following vowel was limited to primary stress and unstressed

vowels, excluding secondary stressed vowels since they do not have an equivalent in

Spanish and Arabic. This yielded 17, 021 segments in the relevant contexts. Each

segment was weighted by the number of times it was observed in that context (the

number of times that word was used). Table 4.2 provides a sample of the data that

was used in the study.

Results In the pure phonology model, distance from word start position signifi-

cantly lowered the odds of the segment to be in a stressed syllable onsets (p < 10−15).

Dorsals and labials were significantly more likely to favor stressed onsets than coro-

nals (p < 10−15), as Dmitrieva and Anttila (2008) showed. Stridents, nasals, stops,

glides and liquids were more likely to be found in stressed syllables (p < 10−15), as

were affricates (p < 0.01). Dentals, palatals and voiced obstruents were less likely to

appear in stressed syllables (p < 10−10, p < 10−15 and p < 10−15 respectively). The

full model can be found in table 4.3.

After refitting the model to allow information theoretic measurements to account

for the variance, both high informativity and high negative log contextual predictabil-

ity promoted the chance of the segment to appear in stressed syllables (p < 10−15, p

< 0.0001 respectively). But their inclusion changed the directionality of place of ar-

ticulation. Labials were now less likely to be found in stressed syllables than coronals

and dorsals (p < 0.01). Some other variables were no longer significant: affricates and


Table 4.2: Sample American English stress data

worda eternal eternal undercover reputation reputation undercoversegmenta /t/ /n/ /k/ /t/ /S/ /v/weight (frequency)b 1 1 5 43 43 5stressedc true false true true false falseplace coronal coronal dorsal coronal coronal labialstop true false true true false falseliquid false false false false false falsenasal false true false false false falseglide false false false false false falsevoiced false false false false false trueaffricate false false false false false falsestrident false false false false true falsepalatal false false false false true falsedental false false false false false falsestart dist. 2 4 5 6 8 7end dist. 5 3 4 5 3 2informativity 1.4638 1.5211 2.4174 1.4638 4.0980 1.6006predictability (resid.) 9.8404 -2.0960 5.1352 -2.0753 -3.0294 -2.1248

a Word and segment were not used in the regression.b Word frequency was used to weigh each data point.c Primary stress or no stress is the predicted value.

voiced obstruents did not have a significant trend in any direction. The full model

can be found in table 4.4.

Discussion As Adams et al. (2009) observed for German, high amount of informa-

tion increases the likelihood of a segment to appear in a stressed syllables (a prominent

position). Several phonological factors did not improve the model following the inclu-

sion of information theoretic variables. This suggests that information, rather than

any of these variables, is indeed a driving force in deciding which segment will appear

in which context.

The significance of the distance from the beginning of the word follows from

American English stress often falling on the first syllable. Of the factors predicted by

Giavazzi (2010), stridents and stops were indeed more likely to be found in stressed


Table 4.3: American English pure phonology model

Estimate Std. error z value Pr(> |z|)intercept -0.07747 0.02373 -3.265 0.00110 **distance from word start (log) -2.47209 0.01469 -168.227 <2e-16 ***poa is dorsal 1.94869 0.01456 133.846 <2e-16 ***poa is labial 1.00560 0.01226 82.054 <2e-16 ***dental -0.25121 0.03709 -6.772 1.27e-11 ***palatal -1.78351 0.04603 -38.750 <2e-16 ***strident 2.04828 0.02184 93.807 <2e-16 ***stop 1.18192 0.01573 75.120 <2e-16 ***nasal 1.12059 0.01900 58.993 <2e-16 ***glide 1.57705 0.03121 50.523 <2e-16 ***liquid 0.93158 0.02194 42.452 <2e-16 ***voiced -0.15663 0.01096 -14.290 <2e-16 ***affricate 0.17152 0.05791 2.962 0.00306 **

syllables. That marked places of articulation (labial and dorsal) were not more or less

likely to be found in stressed syllables than the less marked coronals also supports

Giavazzi’s account, as she predicts that place of articulation will not be affected by

stressed positions. That nasals, liquids and glides also prefer stressed syllables has

no theoretic motivation.

4.3.2 Spanish

Method and materials The study follows the outline listed in §4.2.2, with the

following adjustments. Word counts and dictionary representations were collected

from the CALLHOME Spanish Lexicon (Garrett et al., 1996). Only conversational

spoken Spanish (excluding news broadcasts) was used for calculating word counts and

predictability. The dictionary provided 45, 147 intervocalic segments. Each segment

was weighted by the number of times it was observed in that context (the number of

times that word was used). Table 4.5 provides an example of the data used in the

study.


Table 4.4: American English information model

Estimate Std. error z value Pr(> |z|)intercept -0.533199 0.025331 -21.050 <2e-16 ***distance from word start (log) -2.403947 0.015785 -152.291 <2e-16 ***poa is dorsal 1.427377 0.016559 86.198 <2e-16 ***poa is labial -0.045890 0.019453 -2.359 0.0183 *dental -1.738853 0.041945 -41.456 <2e-16 ***palatal -2.859653 0.035122 -81.420 <2e-16 ***student 1.287754 0.023963 53.740 <2e-16 ***stop 0.604764 0.017107 35.352 <2e-16 ***nasal 0.817213 0.019038 42.924 <2e-16 ***glide 1.011187 0.031595 32.005 <2e-16 ***liquid 0.153127 0.023750 6.448 1.14e-10 ***voiced -0.017785 0.011437 -1.555 0.1199informativity 0.543886 0.007705 70.588 <2e-16 ***predictability 0.009879 0.002286 4.321 1.55e-05 ***

Results In the pure phonology model, distance from word end position significantly

increased the odds of the segment to be in a stressed syllable (p < 10−15). Distance

from the beginning of the word decreased the odds of appearing in a stressed syllable,

even though Spanish stress is more likely to appear the further the segment is from

the beginning of the word as stress in Spanish tends to appear in one of the final

two syllables. Dorsals and labials were significantly more likely to appear in stressed

syllables than coronals (p < 10−15). Stridents, liquids, voiced obstruents, nasals,

stops, glides and palatals were more likely to appear in stressed syllables (all at p <

10−15). Only affricates were less likely to appear in stressed syllables (p < 10−15).

The full model can be found in table 4.6.



ity promoted the chance of the segment to appear in stressed syllables (at p < 10−7

and p < 10−15 respectively). Their inclusion removed the significance of distance from

the beginning of the word, possibly because of its collinearity with predictability. The


Table 4.5: Sample Spanish stress data

word amarillo amarillo amarillo comedia palabrasgloss ‘yellow’ ‘yellow’ ‘yellow’ ‘comedy’ ‘words’phone /m/ /r/ /j/ /m/ /l/weight 24 24 24 5 156prominent false true false true trueplace labial coronal dorsal labial coronalstop false false false false falseliquid false true false false truenasal true false false true falseglide false false true false falsevoiced false false false false falseaffricate false false false false falsestrident false false false false falsepalatal false false true false falsestart dist. 2 4 6 3 3end dist. 6 4 2 5 6informativity 2.9494 1.3772 2.7809 2.9494 2.8573predictability (resid.) 2.7935 0.5244 -2.3598 -0.8619 2.4028

full model can be found in table 4.7.

Discussion As is the case in German and English, high amount of information

increases the likelihood of a segment to appear in a stressed syllable. No variable lost

its significance, but the incorrect influence for distance from beginning of the word was

corrected by the inclusion of information theoretic variables. It is interesting to note

that even in the pure phonological model, obstruent voicing had the opposite influence

than its equivalent in the pure phonological model for English stressed syllables.

Voiced obstruents were less likely to appear in the onset of stressed American English

onsets, but more likely to appear in the onset of Spanish onsets. The inclusion of

obstruent voicing in either model is likely arbitrary, and signifies language-specific

trends, rather than linguistic, cognitive, articulatory or perceptual biases.


Table 4.6: Spanish pure phonology model

Estimate Std. error z value Pr(> |z|)intercept -3.533203 0.034280 -103.07 <2e-16 ***distance from word end (log) 1.339307 0.007038 190.30 <2e-16 ***poa is dorsal 0.577772 0.013278 43.51 <2e-16 ***poa is labial 0.250386 0.009618 26.03 <2e-16 ***strident 1.422171 0.031724 44.83 <2e-16 ***affricate -1.211360 0.039524 -30.65 <2e-16 ***liquid 1.177949 0.031318 37.61 <2e-16 ***distance from word start (log) -0.098767 0.007710 -12.81 <2e-16 ***voiced 0.142255 0.011786 12.07 <2e-16 ***nasal 0.983487 0.030220 32.54 <2e-16 ***stop 0.898821 0.031334 28.68 <2e-16 ***glide 0.653140 0.041628 15.69 <2e-16 ***palatal 0.360191 0.025861 13.93 <2e-16 ***

4.3.3 Egyptian Arabic

Method and materials The study follows the outline listed in §4.2.2, with the fol-

lowing adjustments. Word counts and dictionary representations were collected from

the LDC Egyptian Colloquial Arabic Lexicon (Kilany et al., 1997). Geminates were

considered to be a single segment followed by a gemination symbol. Pharyngeals,

glottals and segments which are not part of the native Egyptian Arabic inventory

(/v,q,>dZ/) were not included. Only conversational spoken Arabic was used for cal-

culating word counts and predictability. This procedure yielded 12, 485 intervocalic

segments. Each segment was weighted by the number of times it was observed in that

context (the number of times that word was used).

Results In the pure phonology model, distance from word end position significantly

increased the likelihood of the segment to appear in stressed syllables, as stress in

Arabic tends to appear in the final or penultimate syllables (in some cases on the

antepenultimate syllable), and the model has no other way to express a dispreference


Table 4.7: Spanish information model

Estimate Std. error z value Pr(> |z|)intercept -3.747105 0.042432 -88.309 <2e-16 ***distance from word end (log) 1.306119 0.007197 181.476 <2e-16 ***poa is dorsal 0.579003 0.014450 40.069 <2e-16 ***poa is labial 0.218332 0.012108 18.032 <2e-16 ***strident 1.468424 0.035287 41.613 <2e-16 ***affricate -1.193195 0.041861 -28.504 <2e-16 ***liquid 1.211737 0.033614 36.049 <2e-16 ***distance from word start (log) 0.013644 0.009149 1.491 0.136voiced 0.163919 0.012085 13.564 <2e-16 ***nasal 1.047424 0.033670 31.109 <2e-16 ***stop 0.899327 0.033996 26.454 <2e-16 ***glide 0.706310 0.044466 15.884 <2e-16 ***palatal 0.281682 0.026142 10.775 <2e-16 ***predictability 0.049276 0.002172 22.691 <2e-16 ***informativity 0.035828 0.006269 5.715 1.1e-08 ***

for the onsets of word-final light syllables from being stressed. Like Spanish, distance

from word start position significantly decreased the likelihood of the segment to ap-

pear in a stressed syllable. Dorsals were less likely than coronals to appear in stressed

syllables and labials more likely than coronals to appear in stressed syllables (both at

p < 10−15). Nasals, glides and voiced obstruents were less likely to appear in stressed

syllables (all at p < 10−15). Palatals, stops and emphatics were more likely to appear

in stressed syllables (at p < 10−15, p < 10−9 and p < 0.001 respectively). The full

model can be found in table 4.9.



ity promoted the chance of the segment to appear in stressed syllables (both at p <

10−15). Distance from the beginning of the word was reversed and promoted the like-

lihood that a segment would appear in stressed syllables (p < 10−15), though it is not

clear why. Finally, emphatic segments were no longer different from non-emphatics.


Table 4.8: Sample Egyptian Arabic stress data

word /abadan/ /abadan/ /QaSara/ /QaSara/ /gamila/ /gamila/gloss ‘never’ ‘never’ ‘ten’ ‘ten’ ‘beautiful’ ‘beautiful’phone /b/ /d/ /S/ /r/ /m/ /l/weight 55 55 114 114 18 18prominent false false false false true falseplace labial coronal coronal coronal labial coronalstop true true false false false falseliquid false false false true false truenasal false false false false true falseglide false false false false false falsevoiced true true false false false falseaffricate false false false false false falsestrident false false true false false falsepalatal false false true false false falseemphatic false false false false false falsestart.dist 2 4 3 5 3 5end.dist 6 4 5 3 5 3informativity 2.7182 2.3157 2.5958 2.0695 2.4511 1.5643predictability (resid.) 2.1582 -2.2409 -1.2295 -2.1346 -0.5913 -1.5808

The full model can be found in table 4.10.

Discussion As is the case with German, English and Spanish, high amount of

information increases the likelihood of a segment to appear in stressed syllables.

The different effect of emphatics disappeared with the inclusion of the information

theoretic variables, which suggests that its inclusion does not have a phonetically

justified reason.

Even in the pure phonological model, place of articulation had a different effect

than in Spanish and English, with dorsals being less likely to appear in stressed

syllables. This undermines any attempt to rely on place of articulation as a reason

for the distribution of segments in stressed syllables, in agreement with Giavazzi

(2010). Additionally, like in English and Spanish, Arabic stops were preferentially

placed in the onsets of stressed syllables. However, Arabic stridents did not have a

significant influence on a segment’s likelihood to appear in stressed syllables, unlike


Table 4.9: Egyptian Arabic pure phonology model

Estimate Std. error z value Pr(> |z|)intercept -2.81070 0.06082 -46.215 <2e-16 ***distance from word end (log) 1.81330 0.02890 62.750 <2e-16 ***poa is dorsal -0.32763 0.03174 -10.323 <2e-16 ***poa is labial 1.00953 0.02771 36.430 <2e-16 ***nasal -0.71157 0.02819 -25.244 <2e-16 ***voiced -0.80284 0.03540 -22.676 <2e-16 ***glide -0.91989 0.04091 -22.488 <2e-16 ***palatal 0.92086 0.03965 23.227 <2e-16 ***distance from word start (log) -0.30309 0.02522 -12.017 <2e-16 ***stop 0.18223 0.02837 6.424 1.33e-10 ***emphatic 0.18107 0.05126 3.532 0.000412 ***

in Spanish and Arabic. Similarly, glides were significantly less likely to appear in

stressed syllables unlike Spanish and English glides, and liquids did not significantly

affect segments’ likelihood to appear in stressed syllables, making their inclusion in

the English and Spanish models language-specific, rather than cross-linguistic.

4.4 General discussion

All three languages, Egyptian Arabic, American English and Spanish, have shown

evidence for a preference to place highly informative segments in stressed syllables.

The three languages have different phonologies and phonotactics, yet every one of

them systematically promoted highly informative segments to perceptually prominent

positions. This preference emerged even though from the phonological point of view

there are very few constraints on onsets of stressed and unstressed syllables, and while

of the phonological variables only stops consistently appeared in stressed syllables. It

is unlikely that a high amount of information was randomly associated with prominent

syllables in all three languages when only one of the other phonological variables

provided similar persistence.


Table 4.10: Egyptian Arabic information model

Estimate Std. error z value Pr(> |z|)intercept -3.261618 0.073370 -44.454 <2e-16 ***distance from word end (log) 1.393080 0.030419 45.797 <2e-16 ***poa is dorsal -0.547353 0.040644 -13.467 <2e-16 ***poa is labial 0.851916 0.033111 25.729 <2e-16 ***nasal -0.673206 0.030772 -21.877 <2e-16 ***voiced -0.943729 0.036552 -25.819 <2e-16 ***glide -0.637876 0.042465 -15.021 <2e-16 ***palatal 0.809559 0.040657 19.912 <2e-16 ***distance from word start (log) 0.255801 0.028254 9.054 <2e-16 ***stop 0.257621 0.028802 8.945 <2e-16 ***predictability 0.256013 0.005794 44.189 <2e-16 ***informativity 0.202002 0.018399 10.979 <2e-16 ***

What made each of the three languages correlate information with perceptual

prominence? The lexicon and the frequencies of its usage are formed by phonological

processes and usage choices for words, roots and affixes. Giavazzi (2010) claims that

stress affects very few features among stressed syllables, and the studies presented in

this chapter support her claim. For argument’s sake I will assume that a wide variety

of such processes may exist. In this case the correlation between prominence and

information could have emerged in three ways:

1. Phonological processes that preferentially reduce informative segments in non-

prominent positions. Duration and deletion studies such as the ones presented

in chapter 2 have not found such effects.

2. Phonological processes that preferentially reduce uninformative segments in

prominent positions. Again, duration and deletion studies found no such ef-

fects.

3. Word, root and affix-selecting processes prefer forms that fulfill the requirement

that highly informative segments would appear in stressed syllables.


All three alternatives require that linguistic (phonology) or psycholinguistic (lex-

ical access) processes be sensitive to the amount of information linguistic elements

hold. Similarly, all three accounts involve sensitivity to perceptual prominence, in

this case syllable stress. Both the sensitivity to information and to perceptual promi-

nence contribute to an ongoing discussion on the role information theoretic factors

have on language production.

In recent years there has been mounting evidence that predictability and fre-

quency affect the duration of linguistic elements in language production. Jurafsky

et al. (2001) showed that frequent words tend to have shorter duration when pre-

dictable in context, and Bell et al. (2009) have shown that frequent content words

tend to have shorter duration than infrequent content words. Similar effects have

been demonstrated below the level of the word, for syllables (Aylett and Turk, 2004),

morphemes (Pluymaekers et al., 2005) and intervocalic segments (van Son and van

Santen, 2005). Information theoretic effects are not limited to duration and were also

shown to affect the omission of linguistic elements at the level of syntactic planning

such as the case of that-omission (Levy and Jaeger, 2007; Jaeger, 2010). The studies

presented in chapter 2 extend this line of research.

However, the source of the information-theoretic effects is under dispute. Some

studies such as Aylett and Turk (2004), van Son and van Santen (2005) and Levy

and Jaeger (2007) (among others) support a view in which the amount of information

an element holds is correlated with its duration in order to improve communication.

Elements that hold little information require less time to transmit and can therefore

be reduced to improve the information rate. On the other hand, elements that hold a

lot of information may require longer to process and should therefore be provided with

longer duration in order to guarantee transmission. In essence, these accounts make

two claims. Linguistic performance is affected by the amount of information linguistic

elements hold, and there is a communicative goal to such effects (communication is

improved). Other accounts propose a different view.

Bell et al. (2009) argue that the reduced duration of frequent words emerges

from the faster access times these words have. They compare the longer duration

of infrequent words to the elongation of word duration when the following context

(or word) is not available (Fox Tree and Clark, 1997; Ferreira and Dell, 2000). Since


infrequent words take longer to access, the articulatory planning that will lead to

their articulation is slowed down and the words end up having longer duration. Bell

et al.’s account differs significantly from the communication-efficiency accounts as it

opposes both the claim that it is the amount of information these words hold that

affects their duration and the claim that the longer duration’s purpose is to improve

communication. Similarly, Bybee and Hopper (2001) view the reduction of duration

of frequent linguistic elements as a byproduct of practice, and therefore implicitly

reject the view of frequency as information and of shortening as means to improve

communication.

Some accounts may accept only one of the two premises. Jaeger (2010, pp. 50–

51) proposes that even if one rejects the idea that speakers attempt to transmit

information efficiently, the flow of information in their mind is still subject to the

same principles, and requires more time to allow more information to be transmitted

in the speakers’ mind. More informative linguistic elements therefore take longer

to process and as a byproduct they take longer to articulate. On the other hand

availability accounts such as Fox Tree and Clark (1997) assume that speakers attempt

to remain fluent (that is, communicative) without referring to information theoretic

effects. Therefore, Fox Tree and Clark assume communication-based biases, but not

necessarily information-based biases.

The results presented in this study require that language be sensitive to the

amount of information linguistic elements hold and to perceptual prominence (in

this case stress). Practice (Bybee and Hopper, 2001) or availability-based accounts

for the duration of words (Bell et al., 2009) do not predict sensitivity to the amount of

information linguistic elements hold. Similarly, the advantage of placing highly infor-

mative elements in perceptually prominent positions is not something that speaker-

internal mechanisms would benefit from. Therefore, the preference for placing highly

informative linguistic elements in stressed syllables shows that at least for some phe-

nomena, it is not possible to do without sensitivity to information theoretic factors,

nor without appealing to communication. This is not to say that in other domains

information theoretic-like effects cannot emerge through mechanisms that have little

to do with the amount of information linguistic elements hold or without appealing to

communication between speakers. However, excluding communication or information


theoretic effects from the range of possible solutions cannot be justified on the basis

of parsimony, as such motivations are necessary.

4.5 Conclusion

In this study I showed that in three languages, the lexicon and usage patterns of

each language make use of the perceptual prominence of stressed syllables to ease

the transmission of highly informative segments across the communication channel.

Only one phonological factor was as consistent in promoting appearance in stressed

syllables as the information theoretic factors. The limited interaction between stress

and the onsets of stressed syllables has been observed in previous research (Smith,

2002; Giavazzi, 2010), but it is not clear why even among the few factors that were

assumed to be influenced by the position of stress only one was persistently affected

by the location of stress. Information is therefore shown to affect phonology in ways

that conventional phonological factors cannot explain.

The findings also demonstrate that information theoretic factors and their ef-

fect on communication are necessary to explain linguistic phenomena, and that such

effects cannot be reduced to other psycholinguistic, cognitive and articulatory fac-

tors. Previous chapters have shown that information-theoretic considerations affect

performance-based phenomena (Aylett and Turk, 2004; van Son and van Santen,

2005) and license phonological processes (chapter 3). This chapter shows that the

amount of information linguistic elements hold affects the phonotactics of the lan-

guage through the lexicon and usage patterns of the lexicon, complementing previous

research (Piantadosi et al., 2009, 2011).

Chapter 5

Predicting segment distribution

universals

5.1 Introduction

Zipf (1935, III: phonemes) shows that across a wide variety of languages (with a

few exceptions) complex segments are less frequent than their simple counterparts:

voiced stops are less frequent than voiceless stops and aspirated stops are less frequent

than unaspirated stops. Zipf argues for a negative correlation between frequency and

complexity. The more complex a segment is, the less frequent it would be in a human

language. The observed correlation is well-motivated by the principle of least effort

(Zipf, 1949) – if segment complexity corresponds to its articulatory effort, simpler

segments should be more frequent than complex ones.

However, taken to extreme, the prediction that simple segments should be more

frequent than complex ones predicts languages in which no complex segments exist.

This prediction is wrong, as none of the languages in Zipf’s survey lost its complex

segments nor reduced them to infinitesimal frequencies. Zipf (1935, III.3.f) therefore

stipulates a set of constraints – upper thresholds of toleration for each segment. A

segment that becomes more frequent than its upper threshold of toleration would be-

gin to weaken. This set of stipulated constraints is necessary to describe the observed

data, but is not motivated except by the sensible requirement that languages main-

tain a minimal number of distinctions. The requirement for maintaining a minimal

125

CHAPTER 5. PREDICTING SEGMENT DISTRIBUTION UNIVERSALS 126

number of distinctions can be met with simpler (though empirically inadequate) stip-

ulations such as a single upper limit on the frequency of any segment. It is therefore

necessary to understand what motivates the multiple upper thresholds in Zipf (1935).

An unconstrained principle of least effort predicts a language with no complex

segments. Information theory (Shannon, 1948) makes the opposite wrong prediction.

If language efficiency is measured in being able to transmit a given amount of infor-

mation using as few segments as possible (have maximal information rate), the most

efficient language encoding would have a uniform distribution of segments. In the

most efficient language every segment would be as frequent as any other segment.

As Zipf (1935) shows, this is not the case, as languages have skewed distributions of

segments. From the information theoretic point of view it is necessary to understand

what keeps human languages from having a uniform distribution of segments.

In this chapter I propose that languages maximize the ratio between the expected

amount of information per segment (the entropy of the language) and the expected

amount of markedness (effort) per segment.1 This prediction is a simple corollary

of MULE, as described in chapter 3: languages attempt to maximize the amount

of information while minimizing the amount of markedness. This proposal correctly

predicts that languages would not lose their complex segments and will not have a

uniform distribution of segments either.

The proposal that language maximize the ratio between the expected amount of

information per segment and the expected amount of effort correctly predicts the

skewed distributions of segment frequency in Zipf (1935) without stipulating upper

thresholds on the frequency of segments. More importantly, it provides a powerful

tool to assess the inverse question: what is the relative markedness of similar seg-

ments? In chapter 3 I stipulated that segments that exist in fewer languages are

more marked or effortful than segments that exist in more languages (an aggregate of

binary distinctions), a method which is not unlike the one used in Ohala (1981) who

matches articulatory and perceptual phonetic observations with the frequency of the

absence of segments in the world’s languages (Sherman, 1975). If my prediction that

languages maximize the ratio between information and markedness holds, it would

1I use the word expected in the statistical sense – an average of alternative outcomes, weightedby their probability. If f(x1) = 8, f(x2) = 4 and Pr(x1) = 0.25, Pr(x2) = 0.75 then the expectedvalue of all f(xi), E [f (xi)] is 0.25 · 8 + 0.75 · 4 = 2 + 3 = 5.


be possible to assess the relative markedness of segments by observing their relative

frequencies in languages in which they do exist (an aggregate of proportions).

One prediction that follows from the predictive power of the proposed account is

that the relative markedness of /t/ is lower than that of /k/ which is lower than that of

/p/. I show that the relative frequency of the three segments P(t)>P(k)>P(p) holds

cross-linguistically, in agreement with the number of languages in which they exist

cross-linguistically (Sherman, 1975; Maddieson, 1984). I further refine my prediction,

by showing that markedness should factor in not only articulatory effort, but also

perceptual confusability.

5.2 Consistent asymmetries between complex and

simple segments

As Zipf (1935)[III] points out, it is not trivial to measure the articulatory effort that

the pronunciation of segments requires. However, for some pairs of segments, it can

be argued that one segment requires roughly the same articulatory gesture as the

other segment and an additional gesture. In this view both /p/ and /b/ require the

speaker to stop the flow of air using the lips, let pressure build and then release it.

/b/ also requires the speaker to cause the vocal folds to vibrate, which /p/ does not.

Zipf calls the segment that requires the additional gesture complex and the other

one simple. He identifies different types of complex stops: voiced vs. voiceless and

aspirated vs. unaspirated. Voiced stops require voicing: /b,d,g/ are the same as

/p,t,k/ except for the vibration of the vocal folds. Aspirated stops require aspiration

to follow the stop: /ph,th,kh/ are the same as /p,t,k/ except that they require the

speaker to delay the VOT of following vowels. Zipf’s distinction between complex

and simple segments is phonemic. He is aware that the exact phonetic correlate of

complex and simple phonemes can vary across environments.

Zipf (1935) goes on to show that across several languages, and with few excep-

tions, complex segments are less frequent than their simple counterparts. He demon-

strates that for all segments that allow contrast between aspirated and unaspirated

obstruents in Mandarin Chinese spoken in Beijing (Peipingese Chinese in Zipf 1935),

Danish and Cantonese Chinese, unaspirated (simple) obstruents are more frequent


than their aspirated (complex) counterparts. Burmese has a three-way contrast be-

tween aspirated fortes, unaspirated fortes and lenes stops, and Zipf shows a similar

pattern between its unaspirated and aspirated fortes stops, with a single exception

– P(kh)>P(k). Zipf’s observations are not independent in the statistical sense – he

took multiple samples from each language, and many of the languages in his sample

are related (though they are not mutually intelligible). Overall, Zipf tests 17 cases of

relative frequencies, and in all but one he correctly predicts that aspirated obstruents

would be less frequent than their unaspirated counterparts.

Zipf’s test of the contrast between voiced and voiceless stops uses eleven Indo-

European languages as well as Hungarian. Using data from phonemically transcribed,

phonetically transcribed and alphabetic data, he shows that in Czech, Dutch, French,

Italian, English, Hungarian, Bulgarian, Russian, Spanish, Greek, Latin and Vedic

Sanskrit, the voiceless stops /p,t,k/ are more frequent than their voiced counterparts

/b,d,g/, with two exceptions – in Spanish P(d)>P(t) and in Hungarian P(b)>P(p).2

In this case too the observations were not statistically independent from one another.

Most of the languages in the sample were related to one another, and multiple samples

were taken from each language. Zipf’s prediction held in 34 out of 36 comparisons.

The observed relative frequencies are attributed to a negative correlation between

frequency and complexity: the more complex segments are, the less frequent they will

be. This pattern would seem to emerge from the principle of least effort (Zipf, 1949):

if more complex is treated as more effortful, effort-avoidance would lead to a desire

to avoid effortful sounds. However, Zipf is well aware that without counter-balancing

effort avoidance, very different languages would be predicted – languages in which no

complex segment exists. In order to avoid that outcome, Zipf (1935, III.3.f) stipulates

that languages must have for each segment an upper threshold of toleration that puts

a limit on the maximal frequency that segment may have, so that a sufficient number

of segments continue to exist in the language. A segment that passes that threshold

would begin to weaken.3 Zipf stipulates that each segment’s different upper threshold

matches the negative correlation between complexity and frequency.

2In the data collected for chapter 3 using the Spanish CALLHOME Lexicon (Garrett et al., 1996),/t/ was more frequent than /d/, following Zipf’s prediction.

3That’s a Zipf (1935) version of MULE, as presented in 3. His version attempts to predict bothweakening and fortition. In contrast, Zipf (1929) version of weakening follows from frequent use anddoes not have similar predictions. It only predicts the weakening of frequent segments.


Multiple different upper thresholds are justified by the need to insure that a suf-

ficient number of contrasts continues to exist in every language. But in order to

maintain the number of contrasts simpler stipulations would have sufficed. A single

uniform upper threshold beyond which the relative frequency of any single segment

cannot rise would have forced each language to maintain distinctions (the lower the

threshold, the greater the number of distinctions). However, a single upper threshold

would have yielded languages in which the frequency of all simple segments is close

to the stipulated upper threshold. This is not the case. In every language that Zipf

tested, /t/ was more frequent than /k/, even though both are simple.4 A require-

ment for individual upper thresholds is therefore required to correctly describe the

data that Zipf provides, but is not motivated.

Unlike Zipf’s prediction of skewed distributions of segments, information theory

(Shannon, 1948) predicts that languages would be most efficient if every segment were

equally probable. This prediction follows from the definition of entropy. A uniform

distribution of segments would yield for a given number of segments a language with

maximal entropy. The entropy of a language is the expected (average) amount of

information every sign in the language contains. In a language that has lower entropy,

every sign provides on average less information, and more signs are needed to transmit

a given amount of information than in a language that has higher entropy. Since

deviating from uniform distribution of segments yields lower entropy, languages in

which the distribution of segments is skewed (is not uniform) would be less efficient.

But every language in Zipf (1935, III) does have skewed distribution of phonemes.

If efficiency plays a role in shaping human language, it is necessary to explain why

the distribution of segments in languages is not uniform. Any functional pressure

on language would have caused a gradual convergence to more efficient form. Even

if languages start off having non-uniform distributions, they are expected to slowly

move in the direction of uniform distribution of segments (using the definition of ef-

ficient communication used in Piantadosi et al. 2011). However, even though related

languages may undergo separate change processes, languages maintain the asymme-

tries Zipf predicts despite undergoing many processes of phonological and phonetic

change. It seems that maximizing entropy is not a functional pressure on languages.

4In Zipf’s data Cantonese has more /k/ than /t/ than /p/. However, Zipf relies on a dictionaryin which k w is regarded as a k followed by a /w/, rather than a labialized dorsal.


If languages do yield to functional pressures over time, and if the functional pres-

sures can be shown not to be motivated by the minimization of complexity nor the

maximization of language entropy, the question is what do languages optimize. I will

attempt to answer this question in the following section.

5.3 Solution: maximizing information per effort

In chapter 3, I introduced MULE, and showed that the balance between effort and in-

formation utility predicts the actuation of weakening processes in language. A simple

corollary of MULE provides a solution to the question above. I propose that lan-

guages maximize the ratio between their entropy and their expected articulatory and

perceptual effort (their markedness). This proposition follows from two observations.

First, the information theoretic prediction that all segments would be equally proba-

ble relies on the assumption that transmitting every segment is equally effortful. This

assumption does not hold in human language. Some segments require more effort to

produce, and some segments are more difficult to tell apart from other segments (Ste-

riade, 2008; Flemming, 2004, among others). Second, the principle of least effort has

to be grounded in the communicative goal of human language – the transmission of

information. Given that language is used to communicate and transmit information,

the principle of least effort has to be interpreted as the least amount of effort to

transmit a message of an arbitrary amount of information in the language.

The proposal that languages maximize the ratio between the information speakers

transmit and the markedness of their transmission (the effort required to transmit

information) predicts the skewed distributions of complex and simple segments in

Zipf (1935) without appealing to upper thresholds. Consider a language of linguistic

elements e, such that every linguistic element is assigned some markedness value which

is always greater than zero, markedness(e) > 0. I will consider three alternatives:

maximizing the entropy of the language, minimizing the expected markedness of the

language, and maximizing the ratio between the entropy of the language and the

expected markedness of the language, as proposed in this chapter. The languages can

optimize these different goals by changing the probability of each element Pr(e).


Languages that maximize the entropy of the language (5.1) have a uniform dis-

tribution of segments. They do require fewer elements to transmit a given amount of

information on average, but they overuse marked or effortful elements. The overall

amount of markedness that is required to transmit a given amount of information is

too high.

(5.1)

−∑e

Pr(e) log2(e)

Languages that minimize the frequency of marked elements (5.2) do not use highly

marked elements at all. As a result they have fewer contrasts, and significantly lower

entropy. These result in significantly longer messages.5 Even though such languages

use fewer marked or effortful elements, the multiplication of the greater number of

elements with the lower amount of markedness per element is still high.

(5.2) ∑e

Pr(e)markedness(e)

Languages that maximize the ratio between the entropy of the language and the

expected amount of markedness per element (5.3) do not have a uniform distribution

of elements, and therefore their entropy is lower than that of the languages in (5.1).

They use unmarked segments more than marked ones, but the increase in message

length pays off in the reduced amount of markedness per segment. In such languages

the correlation between markedness and information is positive. The more marked

an element is, the more information it carries. In frequency terms more information

means lower frequency as Zipf predicts.

(5.3)−∑

e Pr(e) log2(e)∑e Pr(e)markedness(e)

There are two crucial differences between this proposal and the one in Zipf (1935).

The first is that Zipf (1935) relies on complexity, rather than articulatory and percep-

tual phonetic properties of each sound. Zipf (1935) attempts to avoid the question of

5 Half the entropy results in messages that are twice as long for a given amount of information.


effort, as he could not measure the articulatory and perceptual effort associated with

using each segment. While it is still true that the articulatory effort and perceptual

confusability of segments cannot be measured, phonetic theory has evolved and does

provide some insight into what makes some segments more difficult to pronounce or

perceive. For instance, Ohala (1981) shows that the articulatory and perceptual prop-

erties of different stops match the number of times they are absent from phonological

systems (Sherman, 1975). It is more difficult to maintain voicing in dorsal positions,

and indeed /g/ is absent from more systems than /d/ and /b/. Similarly, /p/ has

lower amplitude than other voiceless stops, and is predictably absent from more sys-

tems than /t/ and /k/. Articulatory and perceptual difficulties have therefore been

related to the distribution of sounds in the world’s languages.6

The second difference between the current proposal and the account in Zipf (1935)

is that the current proposal does not stipulate upper thresholds on the frequencies of

segments. Those should emerge directly from the articulatory and perceptual phonetic

properties of each sound and its relationship to other sounds (its confusability with

other sounds, for instance).

It is important to note that the unlike the previous chapters, this chapter mea-

sures the amount of information provided by each segment as its frequency in the

language (its negative log probability or uni-phone score) and not using informativity.

This decision follows from necessity. First, the data in Zipf (1935) contains segment

frequencies, not more elaborate information-theoretic measurements. Second, calcu-

lating segment informativity requires the use of corpora, preferably spoken corpora,

whereas for some languages only a limited amount of written data is available.

In order to explore the differences between the two approaches, I conducted two

language surveys using existing corpora. The first survey attempts to tease apart

complexity and effort by replicating Zipf’s survey of voiced and voiceless oral stops.

The second survey focuses on voiceless stops – all simple in Zipf (1935) – to show

how markedness hierarchies emerge from the in-language frequencies of segments in

the several languages.

6 On the other hand, complexity does seem to correspond to the formal definition of markednessin the form of constraints such as ∗Voiced. Most formal systems of markedness agree with Zipf’sdefinition of complexity by not having constraints of the form ∗/g/, ∗/p/ (Prince and Smolensky,1993; de Lacy, 2002). In formal markedness systems the markedness of specific segments ought tofollow from the conjunction of simpler markedness terms such as ∗Voiced:∗Dorsal.


5.4 Survey 1: effort or complexity?

Introduction What do speakers minimize, effort or complexity? Ohala (1981)

presents an asymmetry among voiceless stops, labial /p/ is less audible than other

voiceless stops and the voicing of dorsal /g/ is more difficult to maintain than the

voicing of other voiced stops. Languages such as Classical Arabic in which both

/p/ and /g/ are absent from the oral stop inventory are not rare. Sherman (1975)

shows that among voiceless stops /p/ is absent from more languages than any other

voiceless stop, and similarly /g/ is absent from more languages than any other voiced

stop. Though cross-linguistically more languages have voiceless stops than voiced

stops (Maddieson, 1984), it is not trivial that Zipf’s characterization of complexity

would be sufficient to describe the data. At the very least the contrast in frequencies

between /b/ and /p/ is expected to be smaller than the difference between other

voiced and voiceless stops.

Additionally, Zipf (1935) tested his prediction that complex segments would be less

frequent than simple segments using several related languages (mostly Indo-European

languages). The only exception, Hungarian, had the relative ratio between /p/ and

/b/ reversed, with P(b)>P(p). The exclusion of unrelated languages makes it difficult

to claim that the similarity between the languages is due to functional pressures

rather than retained similarity that is due to a common origin and shared vocabulary.

Another goal of this survey is therefore to replicate Zipf’s survey using languages that

do not share an origin or as much vocabulary.

Methods and materials I test the difference in ratios between /p,t,k/ and /b,d,g/

separately across several languages, which share very few lexical items, and have quite

different grammars.

1. Japanese, using the CALLHOME Japanese Lexicon (Kobayashi et al., 1996).

Gemination was treated as segment lengthening (for consonants). Palatalized

stops were counted with their non-palatalized counterparts. Voiceless affricates

were counted as allophones of /t/.

2. Spanish, using the CALLHOME Spanish Lexicon (Garrett et al., 1996).

3. Hungarian, using data from Zipf (1935).


4. Biblical Hebrew, using character frequency in the book of Genesis. Gemination

was treated as a segment lengthening, rather than two identical segments. The

distinctions between stop and spirantized variants of /p,t,k/ and /b,d,g/ were

collapsed together.

5. Indonesian, using character frequency in the book of Genesis.7 Digraphs were

treated as single characters.

6. Haitian Creole, using character frequency in the book of Genesis.8. Digraphs

and trigraphs were treated as single characters.

Results The segment probabilities are in Table 5.1. In all six languages, /t/ was

more frequent than /d/ and /k/ was more frequent than /g/, as predicted by Zipf

(1935). If we assume no prior knowledge (each direction is equally probable), the

probability that six (mostly independent) languages would all have more /t/ than /d/

is 0.0156. Having all six languages have the more /k/ than /g/ is equally unlikely.

However, Japanese, Hungarian, Biblical Hebrew and Indonesian all had more /b/

than /p/.9 Zipf (1935) correctly predicts that voiceless /t/ and /k/ would be more

frequent than /d/ and /g/ respectively, but fails to predict that /p/ would not be

more frequent than /b/. Another pattern emerges as well. The ratio between the

frequencies of /k/ and /g/ is greater in every language except Biblical Hebrew than

the ratio between the frequencies of /t/ and /d/ (p > 0.1).

Discussion Zipf (1935) correctly predicts that languages would have consistently

skewed distributions. In all the languages in his survey and in this replication of

his survey, consistent hierarchies between /t/ and /d/ and between /k/ and /g/

were found. However, Zipf’s characterization of segments’ skewed distributions is

apparently imprecise. Had the additional gesture of obstruent voicing been the cause

for preference of voiceless /t/ and /k/ over voiced /d/ and /g/, a similar pattern

would have emerged for /p/ and /b/, but this is not the case. /b/ is more frequent

than /p/ in four out of the six languages used in the survey.

7http://bibledatabase.net/, Indonesian new translation, retrieved August 20128http://bibledatabase.net/, retrieved August 20129In Biblical Hebrew the same hierarchy of frequencies held regardless of whether spirantized stops

were included with their non-continuant counterparts.

http://bibledatabase.net/

http://bibledatabase.net/


Table 5.1: Voiceless and voiced segment probabilities

Japanese segment probabilities

Voiceless Voiced RatioLabial 0.0009 0.0026 0.3Coronal 0.0591 0.0248 2.4Dorsal 0.0578 0.0099 5.9

Spanish segment probabilities


Hungarian segment probabilities


Biblical Hebrew segment probabilities


Indonesian segment probabilities


Haitian Creole segment probabilities


The data does support a view of a negative correlation between segment prob-

ability and effort. Ohala (1981) characterizes /p/ as a less preferred voiceless stop

due to perceptual reasons. This dispreference is evident in the absence of /p/ from

sound systems in the world’s languages (Sherman, 1975). The results of this survey

show that even in sound systems in which all six segments exist, the ratio between

the relative frequency of /p/ and /b/ would be smaller than the ratio between the

frequencies of /t/ and /d/ and between /k/ and /g/. Phonetic effort also predicts

that /g/ would also be dispreferred, since it is difficult to maintain voicing in dorsal

positions. Indeed, in five out of six languages the ratio between /k/ and /g/ was

greater than the ratio between /t/ and /d/, though this trend is not significant (p >

0.1).


5.5 Survey 2: voiceless stops

Introduction A striking fact about the results of the previous survey is that

with the exception of Indonesian, all languages had more /t/ than /k/ than /p/:

P(t)>P(k)>P(p). Indonesian had a different ranking, P(k)>P(t)>P(p). In the Zipf

(1935) survey Mandarin Chinese, Danish, Burmese (lenes), Czech, Dutch, French,

Italian, English, Bulgarian, Russian, Greek and Latin all have P(t)>P(k)>P(p) rank-

ing. The only exception is Vedic Sanskrit P(t)>P(p)>P(k). The frequent pattern

P(t)>P(k)>P(p) matches the number of languages in which one of the voiceless stops

is missing (Sherman, 1975). However, is not predicted by the account presented in

Zipf (1935). Zipf provides no reason to treat any of the voiceless stops as more com-

plex than the other stops. Complexity therefore cannot cause the P(t)>P(k)>P(p)

order to be more frequent than the other orders.

On the other hand, MULE does predict that /t/ would be more frequent than /k/

and /p/ cross-linguistically, since /k/ and /p/ are considered more marked in phonol-

ogy, and MULE predicts a negative correlation between frequency and markedness.

Furthermore, since MULE bases markedness on phonetic grounds, /k/ is predicted to

be more frequent than /p/ on phonetic grounds. The goal of this survey is therefore

to test whether the prediction that MULE makes holds cross-linguistically.

Methods and materials I use the six languages from the previous survey, as well

as Mandarin Chinese, using the data from Zipf (1935), and verified using CALL-

HOME Mandarin Chinese Lexicon (Huang et al., 1996). Mandarin Chinese has a

P(t)>P(k)>P(p) order in both data sources.

There are six possible permutations for ordering /p/, /t/ and /k/. The rankings

are not without order. The frequent permutation in all the languages mentioned in the

introduction to this survey P(t)>P(k)>P(p) is more closely related to the exception

rankings P(k)>P(t)>P(p) and P(t)>P(p)>P(k) than to the other unattested three

rankings, as they require a single reranking of the relative orders, and not two or

three. This relationship is demonstrated in figure 5.1.

Is it a coincidence that the observed rankings emerge? Can they be randomly

generated in a system in which each of the three voiceless stops can be more or less

frequent than any other voiceless stop? I used Kendall’s W (Kendall’s coefficient of


Figure 5.1: Distance of total orders from /t/>/k/>/p/

/t/ > /k/ > /p/

3

��

2

$$

2

zz

1 **1

tt

no wrong orders (widely attested)

/k/ > /t/ > /p/ /t/ > /p/ > /k/ one wrong order (attested)

/p/ > /t/ > /k/ /k/ > /p/ > /t/ two wrong orders (not attested)

/p/ > /k/ > /t/ three wrong orders (not attested)

concordance), which provides a score between 0 and 1 to a set of rankings such that 0

means that the rankings are completely independent from one another, and 1 means

that all the rankings were exactly the same. To test the significance of Kendall’s W,

I created 10 million samples of seven possible rankings, and scored each sample using

Kendall’s W. I then calculated Kendall’s W for the seven languages in the survey –

six languages whose ranking was P(t)>P(k)>P(p), and one language whose ranking

was P(k)>P(t)>P(p) (Indonesian). I calculated the p-value by comparing how many

random samples had greater or equal Kendall’s W. I used R (R Development Core

Team, 2012) and R’s package concord (Lemon and Fellows, 2007).

Results Kendall’s W score for the seven languages was 0.878. Out of the 10 million

random samples of seven ordering, 0.031% had equal or greater Kendall’s W, which

places the p-value of getting the observed rankings by chance at p < 0.001 with a

sample of only seven languages.

Had one of the P(t)>P(k)>P(p) languages been replaced by a P(t)>P(p)>P(k)

language (e.g., by replacing Spanish with Vedic Sanskrit), Kendall’s W would have

been 0.735, with only 0.36% of the 10 million samples having an equal or greater

Kendall’s W. The p-value in this sample of languages would have been p < 0.01.

Discussion The orders of the frequencies between labial, coronal and dorsal voice-

less stops cannot be random. Even with small and diverse samples of languages, a


strong preference for particular orderings appears.

The recurrent orders of frequencies cannot be explained by the minimization of

complexity (Zipf, 1935). Without stipulating arbitrary segment-specific upper thresh-

olds of toleration, this theory would not predict that some simple segments would be

consistently more frequent than other simple segments. However, a negative correla-

tion between markedness and effort does predict the observed patterns.

5.6 Markedness as effort

5.6.1 Missing pieces in the puzzle

Survey 1 showed that phonetic grounding of markedness predicts the negative correla-

tion between markedness and frequency better than the notion of complexity. Survey

2 shows that some voiceless stops tend to be less frequent than other voiceless stops,

in agreement with their absence from segment inventories in the world’s languages.

Two questions remain. First, MULE is based on ranking phonological markedness us-

ing articulatory and perceptual effort. Therefore, when irregular patterns emerge, it

would be beneficial if the irregularity could be explained on phonetic grounds. Second,

both Zipf’s approach and this study rely on languages’ ability to optimize segment

inventories over time. Piantadosi et al. (2011) showed that languages shape their

usage patterns and lexicons to correlate informativity with word length. I showed in

chapter 4 that languages are able to optimize their lexicons and usage patterns in

order to place highly informative segments in perceptually prominent positions. Is

there evidence that such changes are possible in response to the relationship between

segment frequency and phonetic effort?

In the following sections I focus on two of the irregular orders found in the

previous section. I attempt to answer what causes languages to deviate from a

P(t)>P(k)>P(p) order, and examine whether languages can optimize their inven-

tories to have particular orders. Finally, I try to answer what forms of phonetic effort

determine the observed patterns.


5.6.2 Sanskrit stop frequencies

In Zipf’s data, Sanskrit has a P(t)>P(p)>P(k) ranking. Is that order a stable one?

What causes Sanskrit to deviate from the frequent P(t)>P(k)>P(p) order? The

Digital Corpus of Sanskrit (DCS) provides counts for the different types of syllables

of Sanskrit and their distribution over time.10 I processed the data to yield the

different distributions of labial, dental, retroflex, palatal and guttural (velar) voiceless

unaspirated stops over time in (5.4). The unusual P(t)>P(p)>P(k) order only holds

in the early and epic periods. In classical, medieval and late Sanskrit the more

frequent P(t)>P(k)>P(p) is found.11

(5.4) Sanskrit voiceless unaspirated stop frequencies

p t” t c k

early 14870 37738 1694 6773 9972

epic 209583 566986 29491 134040 194392

classical 131327 334920 21212 75075 135034

medieval 83316 231220 18176 47541 106276

late 51709 137705 10715 29688 63852

The change from a P(t)>P(p)>P(k) order to the more common P(t)>P(k)>P(p)

order shows that languages can and do change segment frequencies over time. But

what caused the less frequent order to surface in the first place?

Whitney (1879, II.42–44) claims that the palatal series, which are pronounced

today as alveolo-palatal affricates was originally derived from the velar series, and

was pronounced as true palatals.12 If so, the early stages of Sanskrit may represent a

transient period in which the frequency of /k/ still suffered from the phonemic split

between /k/ and /c/, which was later corrected by the language.

10http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/data/syllables/syllables.htm,retrieved January 2012.

11This analysis uses the dental /t”/ rather than the retroflex / t/ for the cross-linguistic comparisonsince it is more frequent than the retroflex /t/. Additionally, Zipf’s comparison of the frequenciesof /t/-sounds was always based on dental and alveolar /t/s.

12One important support for this claim is that in the database used above, the palatal nasal /ñ/is followed by non-palatals exactly three times (across all periods included in the database). Incontrast, /ñ/ is followed by a palatal stop 18, 063 times. Nasal assimilation to following alveolo-palatal affricates would be to the alveolar stop part of the affricate, a /t/ rather than to the palatalrelease.

http://kjc-fs-cluster.kjc.uni-heidelberg.de/dcs/data/syllables/syllables.htm


Another explanation is also possible. At its early stages, Sanskrit had two dorsal

series – palatal and velar, but only one labial series. Effort in MULE is related both to

articulatory effort and perceptual confusability. If listeners had to tell apart different

dorsals, it would increase the effort that is associated with faithfully transmitting

dorsals. Labials in such a language would be less effortful relative to dorsals. This

explanation predicts that in Hindi, in which there are two series of coronal stops

and one series of dorsal stops, the effort of /t/ might be relatively higher than it

usually is, possibly higher than that of /k/. This prediction is borne out. In Hindi a

P(k)>P(t)>P(p) order is found (5.5) as /t”/ is less frequent than /k/.13

(5.5) Hindi character probabilities

/p/ 0.0266

/t”/ 0.0289

/ t/ 0.0057

/>tS/ 0.0116

/k/ 0.0714

If we take Hindi to be a later stage of Sanskrit, the various stages of the change of

ranking from P(p)>P(t)>P(k) through P(t)>P(k)>P(p) and finally P(k)>P(t)>P(p)

demonstrate two aspects of the correspondence between effort and segment frequency.

First, languages can and do change the frequencies of segment in response to the effort

associated with the articulation and confusability of segments. Second, unlike Zipf’s

assumptions, perceptual difficulty and not only articulatory complexity (or effort)

motivate the change in segment frequencies.

5.6.3 Indonesian stop frequencies

The data from Sanskrit in the previous section showed how segment frequencies may

change in response to phonetic change. Similarly, Indonesian had P(k)>P(t)>P(p)

ranking which deviates from the frequent P(t)>P(k)>P(p) ranking. Is it possible to

account for this ranking?

13Using character count based in http://www.sttmedia.com/characterfrequency-hindi, re-trieved January 2012

http://www.sttmedia.com/characterfrequency-hindi


One way to analyze this deviation is to adopt the approach used in the previous

section for Sanskrit. Indonesian has only one series of stops in each place of articu-

lation – labial, coronal and dorsal. If Indonesian originally had an additional series

of coronal stops, the need to avoid confusion between the two series of coronals may

have caused the frequency of /t/ to drop. While there’s evidence that can support

this hypothesis I will not pursue it here, as that hypothetical series does not exist

today.14 Instead, I will try to understand whether the over-representation of /k/ is

stable, or whether the language is currently undergoing processes that may eventually

restore a P(t)>P(k)>P(p) ranking.

Chapter 3 suggests that phonological processes may follow from segments requiring

too much effort with respect to the information they provide. If so, Indonesian may

provide evidence for /k/-weakening processes. Such a process does exist. O’Brien

(2012) shows, based on data from Lapoliwa (1981), that Indonesian has a /k/→[P] in

coda positions, a process that does not apply to other stops in the same position. The

existence of this process can be predicted by MULE if the P(t)>P(k)>P(p) order is

the expected order in languages in which there is single series of labials, coronals and

dorsals.

5.7 Conclusion

In this chapter I argued that the frequencies of segments in human language is neg-

atively correlated with their effort. The more effort their articulation requires and

the more confusable they are with other segments, the less frequent they will be.

This correlation is not predicted from effort avoidance alone, nor from information

theoretic constraints alone. Instead, I argue that what languages maximize is the

amount of effort that is required to transmit a message of an arbitrary length – the

ratio between the entropy of the language and its markedness.

This proposal has significant predictive power. It predicts the relationship between

the perceptual confusability of segments and their frequency as in the case of Sanskrit

14 Indonesian has three series of stops – labial, alveolar and velar. In addition, it has a fourthretroflex nasal / n/, as well as voiceless and voiced alveolo-palatal affricates. If the affricates and theretroflex / n/ were once a part of a retroflex series of stops, then like Hindi, that stage of Indonesianhad two series of coronal stops.


and Hindi. Expanding on chapter 3, it predicts the propensity of segments to weaken

when they are too frequent as in the case of Indonesian.

The use of this proposal has implications for the study of markedness, as it provides

a tool that allows linguists to compare the markedness of segments without the use

of hundreds of languages, by using the frequencies of segments in the languages in

which they do appear. Given both cross-linguistic frequencies and a great number

of languages, this proposal may allow us to approximate not only the ranking of the

markedness of segments, but also the actual value of the markedness of a segment in

system with a given set of phonemes.

Chapter 6

Conclusions

Finding the dividing line between what keeps languages similar to one another and

different from one another is one of the key challenges of linguistics. It is some-

times the case that it is the differences between languages that reveal the similarities.

This is the case in this thesis. The challenge in chapters 2 and 3 was to understand

language-specific patterns – the reason languages deviate from cross-linguistic tenden-

cies. Chapter 2 attempted to understand why a language such as American English

would so carefully preserve segments that are less likely to be found in the sound

inventories of the world’s languages, such as /p/ and /g/, while reducing sounds that

are almost never absent from sound systems, such as /t/. Chapter 3 investigated

the reason behind parallel weakening processes of /t/ in English and of /q/ in Ara-

bic. The answer in both cases was that speakers are willing to put in more effort to

guarantee the transmission of higher amounts of information.

MULE assumes speakers have two goals: transmit information, and do so using as

little effort as possible. This observation allows MULE to move from language-specific

patterns to predicting cross-linguistic patterns. Chapter 4 showed that languages

distribute perceptual resources as they distribute effort – highly informative sounds

are more likely to appear in the onsets of stressed syllables. In chapter 5 I showed how

MULE predicts the observed cross-linguistic similarity in the frequency of different

segments without any additional stipulations. Thus, the need to explain language-

specific patterns predicts cross-linguistic universals. Both language-specific patterns

and cross-linguistic universals stem from a single linguistic principle.

143

CHAPTER 6. CONCLUSIONS 144

In MULE I propose a joint treatment for phonetic and information theoretic con-

straints in linguistics. As such, it is easy to extend and build on MULE to provide

explanations for phenomena that require the integration of both factors. Such expla-

nations need not be limited to phonology, as the definition of effort and information

can and should be adapted to other domains.

Bibliography

Adams, Matthew E., Schweitzer, Katrin, and Cohen Priva, Uriel, 2009. Crosslinguis-

tic evidence for phone informativity: a corpus study of German. Talk delivered

at the 83rd Annual Meeting of the Linguistic Society of America, San Francisco,

January 10.

Akaike, Hirotugu. 1974. A new look at the statistical model identification. Institute

of Electrical and Electronics Engineers. Transactions on Automatic Control, 19(6):

716–723.

Al-Nassir, Abdulmunim A. 1993. Sibawayh the phonologist: a critical study of the

phonetic and phonological theory of Sibawayh as presented in his treatise al-Kitab.

Kegan Paul Internat., London [u.a.].

Anttila, Arto. 1997. Deriving variation from grammar: a study of Finnish genitives. In

Hinskens, Frans, Hout, Roeland van, and Wetzels, Leo, editors, Variation, change

and phonological theory, pages 35–68. John Benjamins, Amsterdam.

Anttila, Arto. 2006. Variation and opacity. Natural Language and Linguistic Theory,

24(4):893–944.

Aylett, Matthew and Turk, Alice. 2004. The smooth signal redundancy hypothesis: a

functional explanation for relationships between redundancy, prosodic prominence,

and duration in spontaneous speech. Language and Speech, 47(1):31–56.

Aylett, Matthew and Turk, Alice. 2006. Language redundancy predicts syllabic du-

ration and the spectral characteristics of vocalic syllable nuclei. Acoustical Society

of America Journal, 119:3048–3058.

145

BIBLIOGRAPHY 146

Baayen, R. H. 2011. languageR: Data sets and functions with ”Analyzing Linguistic

Data: A practical introduction to statistics”. URL http://cran.r-project.org/

package=languageR. R package version 1.4.

Baayen, R. H., Piepenbrock, R., and Gulikers, L., 1995. The CELEX lexical database

[Release 2].

Baayen, R.H., Davidson, D.J., and Bates, D.M. 2008. Mixed-effects modeling with

crossed random effects for subjects and items. Journal of Memory and Language,

59(4):390–412.

Bates, Douglas, Maechler, Martin, and Bolker, Ben. 2011. lme4: Linear mixed-effects

models using S4 classes. URL http://cran.r-project.org/package=lme4. R

package version 0.999375-42.

Beckman, Jill N. 1998. Positional Faithfulness. PhD thesis, University of Mas-

sachusetts Amherst. ROA 234.

Bell, Alan, Brenier, Jason, Gregory, Michelle, Girand, Cynthia, and Jurafsky, Daniel.

2009. Predictability effects on durations of content and function words in conver-

sational English. Journal of Memory and Language, 60(1):92–111.

Blevins, Juliette. 2004. The Mystery of Austronesian Final Consonant Loss. Oceanic

Linguistics, 43(1):208–213. URL http://www.jstor.org/stable/3623380.

Boersma, Paul. 1998. Functional Phonology. PhD thesis, University of Amsterdam.

Boersma, Paul. 2003. The odds of eternal optimization in Optimality Theory. In Holt,

Eric D., editor, Optimality Theory and Language Change, pages 31–66. Kluwer,

Dordecht.

Boersma, Paul and Hayes, Bruce. 2001. Empirical tests of the Gradual Learning

Algorithm. Linguistic Inquiry, 32(1):45–86.

Brants, Thorsten and Franz, Alex, 2006. Web 1T 5-gram Corpus [Version 1.1].

Google, Inc.

http://cran.r-project.org/package=languageR

http://cran.r-project.org/package=languageR

http://cran.r-project.org/package=lme4

http://www.jstor.org/stable/3623380

BIBLIOGRAPHY 147

Brants, Thorsten and Franz, Alex, 2009. Web 1T 5-gram, 10 European Languages

Version 1. Linguistic Data Consortium, Philadelphia.

Bresnan, Joan and Nikitina, Tatiana. 2009. The gradience of the dative alternation.

In Uyechi, Linda and Hee Wee, Lian, editors, Reality exploration and discovery:

pattern interaction in language and life. Festschrift for K.P. Mohanan. CSLI Pub-

lications, Stanford.

Buck, Carl Darling. 1904. A grammar of Oscan and Umbrian: with a collection

of inscriptions and a glossary. Ginn & Company. http://www.archive.org/

details/agrammaroscanan00goog.

Bybee, Joan and Hopper, Paul. 2001. Introduction to frequency and emergence of

linguistic structure . In Bybee, Joan and Hopper, Paul, editors, Frequency and

the Emergence of Linguistic Structure, pages 1–24. John Benjamins Publishing

Company, Amsterdam.

Carter, M. G. 2004. Sibawayhi. I.B. Tauris, London; New York.

Cieri, Christopher, Miller, David, and Walker, Kevin. 2004. The Fisher Corpus: a

Resource for the Next Generations of Speech-to-Text. In Proceedings of the 4th

International Conference on Language Resources and Evaluation, pages 69–71.

Cieri, Christopher, Graff, David, Kimball, Owen, Miller, Dave, and Walker, Kevin,

2005. Fisher English Training Part 2, Transcripts. Linguistic Data Consortium,

Philadelphia.

Cohen Priva, Uriel. 2008. Using information content to predict phone deletion. In

Abner, Natasha and Bishop, Jason, editors, Proceedings of the 27th West Coast

Conference on Formal Linguistics, pages 90–98, Somerville, MA. Cascadilla Pro-

ceedings Project.

Cohen Priva, Uriel. 2010. Constructing typing-time corpora: A new way to answer

old questions. In Ohlsson, S. and Catrambone, R., editors, Proceedings of the 32nd

Annual Conference of the Cognitive Science Society, pages 43–48, Austin, TX.

Cognitive Science Society.

http://www.archive.org/details/agrammaroscanan00goog

http://www.archive.org/details/agrammaroscanan00goog

BIBLIOGRAPHY 148

Cohen Priva, Uriel and Jurafsky, Dan, 2008. Phone Information Content Influences

Phone Duration. A poster presented at Prosody08, Cornell University. http:

//www.stanford.edu/~urielc/files/Prosody08Poster.pdf.

Dmitrieva, Olga and Anttila, Arto, 2008. The gradient phonotactics of English CVC

syllables. Poster presented at LabPhon11, Wellington, New Zealand, June 30.

Ferreira, Victor S. and Dell, Gary S. 2000. Effect of Ambiguity and Lexical Avail-

ability on Syntactic and Lexical Production. Cognitive Psychology, 40(4):296–340.

Flemming, Edward. 2004. Contrast and perceptual distinctiveness. In Hayes, B.,

Kirchner, R., and Steriade, D., editors, Phonetically-Based Phonology, pages 232–

276. Cambridge University Press. An online version is available at http://web.

mit.edu/flemming/www/paper/CandP13.pdf.

Fox Tree, J. E. and Clark, H. H. 1997. Pronouncing the as thee to signal problems

in speaking. Cognition, 62:151–167.

Garrett, Susan, Morton, Tom, and McLemore, Cynthia, 1996. CALLHOME Spanish

Lexicon. Linguistic Data Consortium, Philadelphia.

Gesenius, Heinrich Friedrich Wilhelm. 1910. Gesenius’ Hebrew Grammar. The

Clarendon Press.

Giavazzi, Maria. 2010. The Phonetics of Metrical Prominence and its Consequences

for Segmental Phonology. PhD thesis, Massachusetts Institute of Technology.

Godfrey, John J. and Holliman, Edward, 1997. Switchboard-1 Release 2. Linguistic

Data Consortium, Philadelphia.

Goldwater, Sharon and Johnson, Mark. 2003. Learning OT constraint rankings using

a maximum entropy model. In Proceedings of the Stockholm workshop on variation

within Optimality Theory, pages 111–120.

Gurevich, Naomi. 2004. Lenition and contrast : the Functional Consequences of

Certain Phonetically Conditioned Sound Changes. Routledge, New York.

http://www.stanford.edu/~urielc/files/Prosody08Poster.pdf

http://www.stanford.edu/~urielc/files/Prosody08Poster.pdf

http://web.mit.edu/flemming/www/paper/CandP13.pdf

http://web.mit.edu/flemming/www/paper/CandP13.pdf

BIBLIOGRAPHY 149

Guy, Greogry. 1991. Explanation in variable phonology: an exponential model of

morphological constraints. Language Variation and Change, 3(1):1–22.

Hahn, Reinhard F. and Ibrahim, Ablahat. 1991. Spoken Uyghur. University of

Washington Press.

Harris, James W. 1969. Spanish Phonology. MIT Press, Cambridge, Mass.

Haspelmath, Martin. 2006. Against markedness (and what to replace it with). Journal

of Linguistics, 42(01):25–70. doi: 10.1017/S0022226705003683.

Haspelmath, Martin. 2008. Creating economical morphosyntactic patterns in lan-

guage change. In Good, Jeff, editor, Linguistic Universals and Language Change,

pages 185–214. Oxford University Press, Oxford.

Hastie, T. J. and Pregibon, D. 1992. Generalized linear models. Wadsworth and

Brooks / Cole.

Hayes, Bruce. 1995. Metrical Stress Theory: Principles and Case Studies. University

of Chicago Press, Chicago.

Hickey, Raymond. 2009. Weak segments in Irish English. In Minkova, Donka, editor,

Phonological weakness in English: from Old to present-day English, pages 116–129.

Palgrave Macmillan, Basingstoke, England; New York.

Hochberg, Judith G. 1986. Functional compensation for /s/ deletion in Puerto Rican

Spanish. Language, 62:609–621.

Hockett, Charles Francis. 1955. A manual of phonology (International Journal of

American Linguistics, 21: 4, Part 1, Memoir 11). Waverly Press, Baltimore.

Horn, Laurence R. 1984. Toward a new taxonomy for pragmatic inference: Q-

based and R-based implicature. In Schiffrin, Deborah, editor, Meaning, form, and

use in context: linguistic applications, pages 11–42. Georgetown University Press,

Washington, D.C.

Huang, Shudong, Bian, Xuejun, Wu, Grace, and McLemore, Cynthia, 1996. CALL-

HOME Mandarin Chinese Lexicon. Linguistic Data Consortium, Philadelphia.

BIBLIOGRAPHY 150

Hume, Elizabeth. 2004. Deconstructing markedness: A predictability-based approach.

In Proceedings of the Berkeley Linguistic Society, volume 30, pages 182–198.

Hume, Elizabeth. 2008. Markedness and the language user. Phonological Studies, 11.

Ito, Junko and Mester, Armin. 2003. On the sources of opacity in OT: coda processes

in German. In Caroline Fery, and Ruben van de Vijver, , editors, The Syllable in

Optimality Theory, pages 271–303. Cambridge University Press.

Jaeger, T. Florian. 2010. Redundancy and reduction: Speakers manage syntac-

tic information density. Cognitive Psychology, 61(1):23–62. ISSN 0010-0285.

doi: 10.1016/j.cogpsych.2010.02.002. URL http://www.sciencedirect.com/

science/article/B6WCR-4YYVCTH-1/2/5ec8d0317cdf485174bb2a87031dd506.

Jurafsky, Daniel, Bell, Alan, Gregory, Michelle L., and Raymond, William D. 2001.

Probabilistic relations between words: Evidence from reduction in lexical produc-

tion. In Bybee, Joan L. and Hopper, Paul, editors, Frequency and the Emergence

of Linguistic Structure, pages 229–254. Benjamins, Amsterdam.

Kahn, Daniel. 1976. Syllable-based Generalizations in English Phonology. PhD thesis,

Massachusetts Institute of Technology.

Kaplan, Abby. 2010. Phonology Shaped by Phonetics: The Case of Intervocalic

Lenition. PhD thesis, Univrersity of California Santa Cruz. ROA 1077.

Kaye, Alan S. and Rosenhouse, Judith. 1997. Arabic dialects and Maltese. In Hetzron,

Robert, editor, The Semitic languages, pages 263–311. Routledge, New York.

Kilany, Hanaa, Gadalla, H., Arram, H., Yacoub, A., El-Habashi, A., Shalaby, A.,

Karins, K., Rowson, E., MacIntyre, R., Kingsbury, P., and McLemore, C., 1997.

LDC Egyptian Colloquial Arabic Lexicon. Linguistic Data Consortium, University

of Pennsylvania.

Kiparsky, Paul, 1993. An OT perspective on phonological variation. Handout from

Rutgers Optimality Workshop 1993, also presented at NWAVE 1994, Stanford

University. Available online at http://www.stanford.edu/~kiparsky/Papers/

nwave94.pdf.

http://www.sciencedirect.com/science/article/B6WCR-4YYVCTH-1/2/5ec8d0317cdf485174bb2a87031dd506

http://www.sciencedirect.com/science/article/B6WCR-4YYVCTH-1/2/5ec8d0317cdf485174bb2a87031dd506

http://www.stanford.edu/~kiparsky/Papers/nwave94.pdf

http://www.stanford.edu/~kiparsky/Papers/nwave94.pdf

BIBLIOGRAPHY 151

Kiparsky, Paul, 1994. Remarks on markedness. Handout of talk presented at TREND-

2. Available online at http://www.stanford.edu/~kiparsky/Papers/trend.pdf.

Kiparsky, Paul. 1995. The phonological basis of sound change. In Goldsmith, John A.,

editor, The Handbook of Phonological Theory, pages 640–670. Blackwell Publishers,

Cambridge, MA.

Kirchner, Robert Martin. 1998. An Effort-Based Approach to Consonant Lenition.

PhD thesis, University of California Los Angeles. ROA 276 http://roa.rutgers.

edu/view.php3?roa=276.

Kisseberth, Charles W. 1970. On the functional unity of phonological rules. Linguistic

Inquiry, 1(3):291–306.

Kobayashi, Megumi, Crist, Sean, Kaneko, Masayo, and McLemore, Cynthia, 1996.

CALLHOME Japanese Lexicon. Linguistic Data Consortium, Philadelphia.

Labov, W. 1994. Principles of Linguistic Change: Internal Factors. Wiley-Blackwell.

de Lacy, Paul. 2002. The Formal Expression of Markedness. PhD thesis, University

of Massachusetts Amherst.

Lapoliwa, H. 1981. A Generative Approach to the Phonology of Bahasa Indonesia.

Department of Linguistics, Research School of Pacific Studies, Australian National

University.

Lemon, Jim and Fellows, Ian. 2007. concord: Concordance and reliability. R package

version 1.4-9.

Levenshtein, Vladimir Iosifovich. 1966. Binary codes capable of correcting deletions,

insertions, and reversals. Soviet Physics Doklady, 10(8):707–710.

Levy, Roger and Jaeger, T. Florian. 2007. Speakers optimize information density

through syntactic reduction. In Scholkopf, Bernhard, Platt, John, and Hofmann,

Thomas, editors, Advances in Neural Information Processing Systems (NIPS), vol-

ume 19, pages 849–856, Cambridge, MA. MIT Press.

http://www.stanford.edu/~kiparsky/Papers/trend.pdf

http://roa.rutgers.edu/view.php3?roa=276

http://roa.rutgers.edu/view.php3?roa=276

BIBLIOGRAPHY 152

Lombardi, Linda, 1995. Why place and voice are different. Rutgers Optimality

Archive (ROA) 105.

Maddieson, Ian. 1984. Patterns of sounds. Cambridge studies in speech science and

communication. Cambridge University Press, Cambridge.

Mathisen, Anne Grethe. 1999. Sandwell, West Midlands: Ambiguous perspectives

on gender patterns and models of language change. In Foulkes, Paul and Docherty,

Gerard J., editors, Urban Voices: Accent Studies in the British Isles, pages 107–123.

Arnold Publishers.

McCarthy, John J. 1994. The phonology and phonetics of Semitic pharyngeals.

Phonological structure and phonetic form: papers from Laboratory Phonology III,

pages 191–234.

McCarthy, John J. and Prince, Alan. 1995. Faithfulness and reduplicative identity.

University of Massachusetts Occasional Papers in Linguistics, 18:249–384.

Miller, George A. 1995. WordNet: a lexical database for English. Commun. ACM,

38(11):39–41. ISSN 0001-0782. doi: 10.1145/219717.219748.

O’Brien, Jeremy. 2012. An Experimental Approach to Debuccalization and Supple-

mentary Gestures. PhD thesis, Univrersity of California Santa Cruz.

Ohala, John J. 1981. The origin of sound patterns in vocal tract constraints. In

Macneilage, P., editor, The Production of Speech. Springer Verlag, New York.

Ohala, John J. 2003. Phonetics and historical phonology. In Joseph, Brian D. and

Janda, Richard D., editors, The Handbook of Historical Linguistics, pages 669–686.

Blackwell.

Piantadosi, Steven .T., Tily, Harry J., and Gibson, Edward. 2009. The communicative

lexicon hypothesis. In The 31st annual meeting of the Cognitive Science Society

(CogSci09), pages 2582–2587. URL http://web.mit.edu/piantado/www/papers/

piantadosiTilyGibson2009.pdf.

http://web.mit.edu/piantado/www/papers/piantadosiTilyGibson2009.pdf

http://web.mit.edu/piantado/www/papers/piantadosiTilyGibson2009.pdf

BIBLIOGRAPHY 153

Piantadosi, Steven T., Tily, Harry J, and Gibson, Edward. 2011. Word lengths are

optimized for efficient communication. Proceedings of the National Academy of

Sciences.

Pierrehumbert, Janet. 2001. Exemplar dynamics: Word frequency, lenition and

contrast. In Bybee, Joan and Hopper, Paul, editors, Frequency and the Emergence

of Linguistic Structure, pages 137–157. John Benjamins Publishing Company.

Pitt, M.A., Dilley, L., Johnson, K., Kiesling, S., Raymond, W., Hume, E., and Fosler-

Lussier, E., 2007. Buckeye Corpus of Conversational Speech (2nd release). Depart-

ment of Psychology, Ohio State University.

Pluymaekers, Mark, Ernestus, Mirjam, and Baayen, R. Harald. 2005. Articulatory

planning is continuous and sensitive to informational redundancy. Phonetica, 62:

146–159.

Poplack, Shana, 1980. The Notion of the Plural in Puerto Rican Spanish: Competing

Constraints on (s) Deletion.

Pouplier, Marianne. 2003. The dynamics of error. In Proceedings of the 15th Inter-

national Congress of Phonetic Sciences.

Prince, Alan S. and Smolensky, Paul. 1993. Optimality Theory: Constraint Interac-

tion in Generative Grammar. Blackwell, Malden, MA.

R Development Core Team, . 2011. R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http:

//www.r-project.org.

R Development Core Team, . 2012. R: A Language and Environment for Statistical

Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http:

//www.r-project.org.

Raymond, Hickey. 2004. Irish English: Phonology, volume 1 of Varieties of English,

pages 68–97. Mouton de Gruyter, Berlin ; New York.

http://www.r-project.org




BIBLIOGRAPHY 154

Raymond, William D., Dautricourt, Robin, and Hume, Elizabeth. 2006. Word-medial

/t,d/ deletion in spontaneous speech: Modeling the effects of extra-linguistic, lexi-

cal, and phonological factors. Language Variation and Change, 18.

Shannon, Claude Elwood. 1948. A mathematical theory of communication. The Bell

System Technical Journal, 27:379–423.

Shannon, Claude Elwood. 1951. Prediction and entropy of printed English. Bell

System Technical Journal, 30:50–64.

Sherman, D. 1975. Stop and fricative systems: a discussion of paradigmatic gaps and

the question of language sampling. Working Papers on Language Universals, 17:

1–33.

Smith, Jennifer. 2002. Phonological Augmentation in Prominent Positions. PhD

thesis, UMass Amherst.

Smolensky, Paul. 1993. Harmony, markedness, and phonological activity. Paper

presented at Rutgers Optimality Workshop 1.

Smolensky, Paul, 1995. On the internal structure of constraint component con of

UG. Talk given at UCLA, April 7. ROA 86.

Steriade, Donca. 1997. Phonetics in phonology: the case of laryngeal neutralization.

Ms., UCLA.

Steriade, Donca. 2008. The phonology of perceptibility effects: the P-map and its

consequences for constraint organization. In Hanson, Kristin and Inkelas, Sharon,

editors, The Nature of the Word: Studies in Honor of Paul Kiparsky, pages 151–

180. MIT, Cambridge, Mass.; London.

Surendran, Dinoj and Niyogi, Partha. 2006. Quantifying the functional load of

phonemic oppositions, distinctive features, and suprasegmentals. In Nedergaard

Thomsen, Ole, editor, Competing Models of Linguistic Change: Evolution and Be-

yond. In commemoration of Eugenio Coseriu (1921-2002). Benjamins, Amsterdam

and Philadelphia.

BIBLIOGRAPHY 155

Urszula, Clark. 2004. The English West Midlands: Phonology, volume 1 of Varieties

of English, pages 134–162. Mouton de Gruyter, Berlin ; New York.

Son, R. J. J. H. van and Pols, L. C. W. 2003. How efficient is speech? Proceedings

of the Institute of Phonetic Sciences, 25:171–184.

Son, R.J.J.H. van and Santen, J.P.H van. 2005. Duration and spectral balance of

intervocalic consonants: a case for efficient communication. Speech Communication,

47:100–123.

Venables, W. N. and Ripley, B. D. 2002. Modern Applied Statistics with S. Springer,

4th edition.

Weber, David. 1989. A grammar of Huallaga (Huanuco) Quechua, volume 112 of

University of California publications in linguistics. University of California Press,

Berkeley.

Weide, R., 1998. The CMU Pronunciation Dictionary, release 0.6. Carnegie Mellon

University.

Weinreich, Uriel, Labov, William, and Herzog, Marvin I. 1968. Empirical foundations

for a theory of language change. In Lehmann, Winfred P. and Malkiel, Yakov, ed-

itors, Directions for Historical Linguistics, pages 95–18. University of Texas Press,

Austin.

Whitney, William Dwight. 1879. A Sanskrit grammar; including both the classical lan-

guage, and the older dialects, of Veda and Brahmana. Breitkopf and Hartel, Leipzig.

URL http://www.archive.org/details/sanskritgrammari00whituoft.

Zhao, Yuan and Jurafsky, Dan. 2009. The effect of lexical frequency and Lombard

reflex on tone hyperarticulation. Journal of Phonetics, 37(2):231–247. ISSN 0095-

4470. doi: 10.1016/j.wocn.2009.03.002. URL http://www.sciencedirect.com/

science/article/pii/S0095447009000175.

Zipf, George K. 1935. The Psycho-biology of Language: an Introduction to Dynamic

Philology. Houghton, Mifflin.

http://www.archive.org/details/sanskritgrammari00whituoft

http://www.sciencedirect.com/science/article/pii/S0095447009000175

http://www.sciencedirect.com/science/article/pii/S0095447009000175

BIBLIOGRAPHY 156

Zipf, George Kingsley. 1929. Relative frequency as a determinant of phonetic change.

Harvard Studies in Classical Philology, 15:1–95.

Zipf, George Kingsley. 1949. Human Behavior and the Principle of Least Effort: an

Introduction to Human Ecology. Hafner Publisher Company, New York.

Zue, Victor W. and Laferriere, Martha. 1979. Acoustic study of medial /t,d/ in

American English. The Journal of the Acoustical Society of America, 66(4):1039–

1050. doi: 10.1121/1.383323. URL http://link.aip.org/link/?JAS/66/1039/1.

http://link.aip.org/link/?JAS/66/1039/1

sign and signal deriving linguistic …wg646gh4444/urielcohenpriva-dissertation...sign and signal...

Documents