the modulation spectrum – its role in sentence and consonant identification steven greenberg...

100
The Modulation Spectrum Its Role in Sentence and Consonant Identification Steven Greenberg Centre for Applied Hearing Research Technical University of Denmark Silicon Speech Santa Venetia, CA USA http://www.icsi.berkeley.edu/~steveng [email protected]

Upload: charlene-arnold

Post on 29-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

The Modulation Spectrum – Its Role in Sentence and Consonant

Identification

Steven GreenbergCentre for Applied Hearing ResearchTechnical University of Denmark

Silicon SpeechSanta Venetia, CA USA

http://www.icsi.berkeley.edu/[email protected]

Acknowledgements and ThanksResearch FundingU.S. National Science FoundationOtto Mønsted Foundation (Denmark)Danish Research CouncilTechnical University of Denmark (Torsten Dau)

Research CollaboratorsTakayuki AraiThomas ChristiansenRosaria Silipo

The Crux of the Problem ….

Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

Yet, THE INTELLIGIBILITY OF SPEECH IS REMARKABLY STABLE

Implying that intelligibility is based NOT on fine spectral detail, but rather on some more basic parameter(s) – what might these be????

This presentation examines the origins of word intelligibility in the low-frequency      (< 30 Hz) modulation properties of the acoustic speech signal

These modulation patterns, which reflect articulatory movement, are differentially distributed across the acoustic frequency spectrum

Intelligibility Based on Modulation Patterns

The specific configuration of the modulation patterns across the frequency spectrum reflects the essential cues for understanding spoken language

Intelligibility Based on Modulation Patterns

The acoustic frequency spectrum serves as a DISTRIBUTION MEDIUM for the modulation patterns

However, much of the ACOUSTIC SPECTRUM is, in fact, DISPENSIBLE (Harvey Fletcher, Jont Allen and others to the contrary)

Intelligibility Based on Modulation Patterns

Structure of the PresentationThis presentation will focus on the following issues

A sparse spectral representation of speech is ….

Sufficient for good intelligibility (though not entirely natural sounding, nor particularly robust in background noise)

Low-frequency modulations below 30 Hz appear to serve as the primary carriers of phonetic information in the speech signal

The role played by different parts of the modulation spectrum is at the outset unclear

The presentation will attempt to elucidate this question, both for sentences (first part) and for consonants (second part)

The distribution of modulation information across the audio-frequency (tonotopic) spectrum is also important (and will be addressed as well)

The perceptual data described may be useful for developing future-generation speech technology (e.g., automatic speech recognition and synthesis) germane to auditory prostheses (e.g., hearing aids and cochlear implants)

An Invariant Property of the Speech Signal?Houtgast and Steeneken demonstrated that the modulation spectrum, a temporal property, is highly predictive of speech intelligibility

This is significant, as it is difficult to degrade intelligibility through normal spectral distortion (many have tried, few have succeeded …. )

In highly reverberant environments, the modulation spectrum’s peak is highly attenuated, shifting down to ca. 2 Hz, the speech becoming increasingly difficult to comprehend

[based on an illustration by Hynek Hermansky]

Modulation Spectrum

Quantifying Modulation Patterns in SpeechThe modulation spectrum provides a quantitative method for computing

the amount of modulation in the speech signal

The technique is illustrated for a paradigmatic signal for clarity’s sake

The computation is performed for each spectral channel separately

The Modulation Spectrum Reflects SyllablesGiven the importance of the modulation spectrum for intelligibility, what does it reflect linguistically?

The distribution of syllable duration matches the modulation spectrum, suggesting that the integrity of the syllable is essential for understanding speech

Modulation spectrum of 15 minutes of spontaneous Japanese speech (OGI-TS corpus) compared with the syllable duration distribution for the same material (Arai and Greenberg, 1997)

Syllable duration(modulation frequency)

Modulation Spectrum

Comparable comparisons have been performed for (American) English

Intelligibility Derived from Modulation PatternsMany perceptual studies emphasize the importance of low-frequency modulation patterns for understanding spoken language

Historically, this was first demonstrated by Homer Dudley in 1939 with what has become known as the VOCODER – modulations higher than 25 Hz can be filtered out without significant impact on intelligibility

As mentioned earlier, Houtgast and Steeneken demonstrated that the low-frequency modulation spectrum is a good predictor of intelligibility in a wide range of acoustic listening environments (1970s and 1980s)

In the mid-1990s, Rob Drullman demonstrated the impact of low-pass filtering the modulation spectrum on intelligibility and segment identification – modulations below 8 Hz appeared to be most important

However, …. all of these studies were performed on broadband speech

There was no attempt to examine the interaction between temporal and spectral factors for coding speech information

(Other studies, such as those by Shannon and associates, have examined spectral-temporal interactions, but not at a fine level of detail)

Intelligibility Studies Using Spectral SlitsThe interaction between spectral and temporal information for coding speech information can be examined with some degree of precision using spectral slits

In what follows, the use of the term “spectral” refers to operations and processes in the acoustic frequency (i.e., tonotopic) domain

The term “temporal” or “modulation spectrum” refers to operations and processes that specifically involve low-frequency (< 30 Hz) modulations

First, we’ll examine the impact of extreme band-pass SPECTRAL filtering on intelligibility without consideration of the modulation spectrum

Intelligibility of Sparse Spectral SpeechThe spectrum of spoken sentences (TIMIT corpus) can be partitioned into

narrow (1/3-octave) channels (“slits”)

In the example below, there are four, one-third-octave slits distributed across the frequency spectrum

The edge of a slit is separated from its nearest neighbor by an octave

No single slit, by itself, is particularly intelligible

The intelligibility associated with any single slit is only 2 to 9%

Word Intelligibility - Single Slits

Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%

The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Intelligibility of Sparse Spectral SpeechTwo slits, when combined, provide a higher degree of

intelligibility, as shown on the following slides

Word Intelligibility - 2 Slits

Word Intelligibility - 2 Slits

Word Intelligibility - 2 Slits

Word Intelligibility - 2 Slits

Word Intelligibility - 2 Slits

Intelligibility of Sparse Spectral SpeechClearly, the degree of intelligibility depends on precisely where the slits are

situated in the frequency spectrum, as well as their relationship to each other

Spectrally contiguous slits may (or may not) be more intelligible than those far apart

Slits in the mid-frequency region, corresponding to the signal’s second formant, are the most intelligible of any two-slit combination

Word Intelligibility - 2 Slits

Intelligibility of Sparse Spectral SpeechThere is a marked improvement in intelligibility when three slits

are presented together

Particularly when the slits are spectrally contiguous

Word Intelligibility - 3 Slits

Word Intelligibility - 3 Slits

Word Intelligibility - 3 Slits

Word Intelligibility - 3 Slits

Intelligibility of Sparse Spectral SpeechFour slits combined yield nearly (but not quite) perfect

intelligibility

Word Intelligibility - 4 Slits

Intelligibility of Sparse Spectral SpeechThis was done intentionally in order that the contribution of each

slit could be precisely delineated

Without having to worry about “ceiling” effects for highly intelligible conditions

Modulation Spectrum Across FrequencyThe modulation spectrum varies in magnitude across frequency

Modulation Spectrum Across FrequencyThe shape of the modulation spectrum is similar for the three lowest slits….

Modulation Spectrum Across FrequencyBut the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies

Raising the prospect that the mid-frequency modulation spectrum (10-30 Hz) may be important under certain conditions

Modulation Spectrum Across FrequencyThe high amount of energy in the mid-frequency MODULATION spectrum is typical of

material whose ACOUSTIC spectrum is higher than 3 kHz

And does not depend solely on the use of narrow spectral slices

As shown in this sample of OCTAVE-WIDE channels of broadband speech (or broader than an octave for the lowest sub-band)

TIMIT Corpus 40 sentences

Low-pass Modulation Filtering of SlitsThe MODULATION SPECTRUM of the spectral slits shown in previous slides can be LOW-PASS FILTERED in order to ascertain the relation between modulation patterns and their spectral affiliation

For simplicity’s sake, either the lowest two (slits 1 + 2) or highest two (3 + 4) slits were low-pass modulation filtered in tandem

Modulation Spectrum Across FrequencyEach sentence presented contained four spectral slits

Baseline performance – 4 slits without modulation filtering – was 87% intelligibility

The modulation spectrum was systematically low-pass filtered between 24 Hz and 3 Hz, in 3-Hz steps for each of the two-slit combinations, without modulation filtering the other two slits in the stimulus

Modulation Spectrum Across FrequencyThe general effect of low-pass modulation filtering is similar

Low-pass filtering below 12 Hz has a significant impact on intelligibility, which is particularly pronounced when the modulation spectrum is restricted to frequencies lower than 6 Hz

However, there is a significant difference in the impact of low-pass modulation filtering depending on whether the slits are in the low or high portion of the ACOUSTIC frequency spectrum

Modulation Spectrum Across FrequencyWhen the low-pass modulation filtered slits are in the low spectral frequencies (<1 kHz) there is a progressive decline of intelligibility

Moreover, low-pass modulation filtering above 15 Hz has no significant impact on intelligibility

In contrast, low-pass modulation filtering the high-frequency (>2 kHz) slits does impact intelligibility, even for a low-pass cutoff of 24 Hz

Modulation Spectrum Across FrequencyThis result implies that modulation frequencies higher than 24 Hz contribute to intelligibility, but only for the acoustic spectrum above 2 kHz

According to some, only modulation frequencies below 8 Hz contribute to intelligibility

However, recall that these other studies used full bandwidth speech signals

Low-pass filtering the modulation spectrum of such broadband stimuli does not necessarily remove the upper portion of the modulation spectrum

Much of the higher modulation spectrum could have been re-introduced through cross-channel phase distortion (as suggested by Ghitza, 2001)

In addition, the inherent redundancy of the full bandwidth signal makes it difficult to ascertain the specific contribution of each spectral region and modulation frequency

Some other method is required to tease apart the spectro-temporal components of intelligibility

Modulation Spectrum Across Frequency

The Story So FarThe low-frequency modulation spectrum is crucial for understanding spoken language

However, it is unclear precisely WHICH parts of the modulation spectrum contribute most heavily, and how much their contribution depends on their acoustic spectral (i.e., tonotopic) affiliation

The details are important for technical exploitation of these ideas (for application in hearing aids, automatic speech recognition and synthesis)

The Next Chapter – Consonant IdentificationBecause word intelligibility depends ultimately on listeners’ ability to decode phonetic information in the acoustic signal

A more fine-grained approach to the spectro-temporal foundations of speech processing may benefit from examining the ability to identify specific consonantal segments

By focusing on consonant identification it is possible to study certain aspects of auditory processing associated with speech understanding in greater detail (and with more precision) than is possible through intelligibility alone

Moreover, it is possible to decompose consonants into more elementary “building blocks” known as articulatory-acoustic (or phonetic) features

This phonetic decomposition into the fundamental phonetic dimensions of “voicing,” “manner of production,” and “place of articulation” provides some interesting insights into the auditory basis of speech processing

A Brief Introduction to Phonetic FeaturesThree principal articulatory dimensions are distinguished (among

others) – VOICING, MANNER and PLACE of articulation

As illustrated for a sample word, “Nap” [n ae p]

In order to correctly identify a consonant, all three principle phonetic feature dimensions need to be decoded correctly (at least in principle)

Place Voiced

Lightly Accented

[n] [ae] [p]

Voiced Unvoiced

Nasal Vocalic Stop

Alveolar (Medial) Bilabial (Front)

ProsodicAccent

Segment

Manner

Voicing

Place Front

Word “nap” (def: seminar activity)

The Arc’s Relation to Phonotactics & MannerIf we return to the basic question – WHY are syllables realized as

rises and falls of energy ….

And we make the simple assumption that each manner of production – vowel, fricative, nasal, stop etc. – is associated with a relative energy level

Vowels being highest

Stops and fricatives lowest

With nasals, liquids and glides in between

Then we gain some insight as to why the segments occur in the order they do within the syllable

The Energy Arc’s Relation to Syllable PhonotacticsIn effect, the segments reflect various manners of production, which

are associated with different energy levels

From the perspective of “command and control” the relation between syllable production and the energy arc is automatic and unconscious

Syllables are intrinsically arcs that are readily digested by the auditory system and the brain

This may account for why it is possible to articulate (and perceive) in terms of syllables, but not in terms of isolated phones (unless they are syllables themselves)

The Syllabic Control of Voicing – SignificanceThe most energetic components of the speech signal are usually voiced

Voicing helps to build up energy in the syllable

Voicing provides implicit structure for the syllable

This structure could be extremely important in decoding the speech signal, particularly in noisy environments

Recall the importance of fundamental-frequency information for separating concurrent talkers or distinguishing speech from a noisy background

Pitch-related cues could only play such an important role if the speech signal is largely voiced

voiced voiced voicedvoi

The Relation Between Voicing and MannerThus, voicing appears to cut across segmental boundaries

It only APPEARS to be associated with individual segments

Voicing serves to bind the segments into a syllabic whole through its temporal continuity

It is probably not coincidental that 80% (or more) of the speech signal is voiced

And that relatively few manner classes (usually stops, afficates, fricatives) can be realized as unvoiced (except in whispered or exaggerated speech)

Voicing is indirectly related to the energy arc, in that it is associated with the most intense components of the syllable, and is most robust to noise and reverberation

Thus, it is extremely important for decoding speech in noisy environments

Place of Articulation – the Key DimensionArticulatory place information is important for distinguishing among

syllables and words (particularly for consonants)

The distinction among [b], [d] and [g], and [p], [t] and [k] is primarily one of “place,” in that the location of maximum articulatory constriction varies from front to back

FRONTMEDIAL

BACK

Generally, there are only three distinct loci of constriction for any single manner class

Hence, the problem of determining articulatory place is greatly simplified if the manner of production is known

Manner-dependent place of articulation classifiers have been successfully applied in automatic phonetic transcription

(e.g., Chang, Wester & Greenberg, 2001, 2005)

Place of ArticulationThe formant patterns associated with place of articulation cues vary broadly over frequency and time

When speech is described as “dynamic” it is usually such formant patterns that are meant (this is a little misleading, in that syllable cues are also highly dynamic, but this is a separate story ….)

In low signal-to-noise ratio conditions and among the hearing impaired, place-of-articulation cues are usually among the first to degrade

Place of ArticulationThe reasons for this seeming vulnerability are controversial, but can be understood through analysis of data shown on the following slides

In this experiment, nonsense VC and CV (Am. English) syllables were presented to listeners, who were asked to identify the consonant

The syllables were spectrally filtered (in one-third octave bands), so that most of the spectrum was discarded

The proportion of consonants correctly recognized was scored as a function of the number of spectral slits presented and their frequency location, as shown on the next series of slides

The really interesting analysis comes afterwards (so please be patient) ….

Consonant Recognition - Single Slits

5400 Hz

2100 Hz

875 Hz

330 Hz

Slits are 1/3-octave wide

Consonant Recognition - 1 Slit

Consonant Recognition - 2 Slits

Consonant Recognition - 2 Slits

Consonant Recognition - 3 Slits

Consonant Recognition - 3 Slits

Consonant Recognition - 4 Slits

Consonant Recognition - 5 Slits

Articulatory-Feature AnalysisThe results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves

They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition

Moreover, a more densely sampled spectrum results in higher recognition

However, we can perform a more detailed analysis by examining the pattern of errors made by listeners

From the confusion matrices we can ascertain precisely WHICH ARTICULATORY FEATURES are affected by the various manipulations imposed

And from this error analysis we can make certain deductions about the distribution of phonetic information across the tonotopic frequency axis potentially relevant to understanding why speech is most effectively communicated via a broad spectral carrier

The results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves

They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition

Moreover, a more densely sampled spectrum results in higher recognition

The Bottom Line – So Far

The data can also be scored in terms of the proportion of phonetic features correctly decoded

However, we can perform a more detailed analysis by examining the pattern of errors made by listeners

Phonetic Feature Specification – English

In order to understand how this is done (and its significance) it is useful to examine a phonetic feature specification for the consonants involved

VOICING MANNER PLACE

p – Stop Frontt – Stop Medialk – Stop Backb + Stop Frontd + Stop Medialg + Stop Backs – Fricative Medialf – Fricative Frontv + Fricative Frontm + Nasal Frontn + Nasal Medialy + Glide Frontw* + Glide Back w* = +[round]

Phonetic Feature Specification – English

Perceptual Confusion Matrix – Example

p t k b d g s f v m np 19 13 1 0 0 0 0 3 0 0 0t 1 32 2 0 1 0 0 0 0 0 0k 1 8 27 0 0 0 0 0 0 0 0b 0 0 0 25 10 0 0 0 0 1 0d 0 0 0 2 34 0 0 0 0 0 0g 0 0 1 5 7 22 0 0 1 0 0s 0 0 0 0 0 0 32 4 0 0 0f 0 1 0 0 1 0 23 10 1 0 0v 0 0 0 0 0 1 0 0 27 7 1m 0 0 0 0 0 0 0 0 1 24 4n 0 0 0 0 0 0 0 0 8 3 32

Response

Stimulus

The pattern of identification errors can be used to deduce which phonetic features are more robust and which are most vulnerable to distortion

Perceptual Confusion Matrix – Example

p t k b d g s f v m np 19 13 1 0 0 0 0 3 0 0 0t 1 32 2 0 1 0 0 0 0 0 0k 1 8 27 0 0 0 0 0 0 0 0b 0 0 0 25 10 0 0 0 0 1 0d 0 0 0 2 34 0 0 0 0 0 0g 0 0 1 5 7 22 0 0 1 0 0s 0 0 0 0 0 0 32 4 0 0 0f 0 1 0 0 1 0 23 10 1 0 0v 0 0 0 0 0 1 0 0 27 7 1m 0 0 0 0 0 0 0 0 1 24 4n 0 0 0 0 0 0 0 0 8 3 32

Response

Stimulus

Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner

As indicated by the yellow rectangles for common place of articulation confusions (within the same manner and voicing class)

Phonetic-Feature/Consonant Identity – CorrelationConsonant recognition is

almost perfectly correlated with place-of-articulation performance

This correlation suggests that PLACE features are based on cues DISTRIBUTED across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower span of the spectrum

MANNER is also highly correlated with consonant recognition, implying that such features are extracted from a fairly broad portion of the spectrum as well

Let’s Go Danish (for Consonant Identification)

750 Hz

1500

3000

85%}Three Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

39%

43%

42%

Single Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

30%

37%

34%

750 Hz

1500

3000

26%

33%

28%

< 24 Hz

< 12 Hz

750 Hz

3000

68%}< 24 Hz

Unfiltered

750 Hz

3000

59%}750 Hz

3000

67%}

750 Hz

3000

61%}< 12 Hz

< 12 Hz

< 24 Hz

750 Hz

1500

3000

82%}

750 Hz

1500

3000

80%}

< 24 Hz

< 12 Hz

NoModulationFiltering

ModulationFiltering

Unfiltered

Unfiltered

Unfiltered

750 Hz

1500

3000

85%}Three Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

39%

43%

42%

Single Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

30%

37%

34%

750 Hz

1500

3000

26%

33%

28%

< 24 Hz

< 12 Hz

750 Hz

3000

68%}< 24 Hz

Unfiltered

750 Hz

3000

59%}750 Hz

3000

67%}

750 Hz

3000

61%}< 12 Hz

< 12 Hz

< 24 Hz

750 Hz

1500

3000

82%}

750 Hz

1500

3000

80%}

< 24 Hz

< 12 Hz

NoModulationFiltering

ModulationFiltering

Unfiltered

Unfiltered

Unfiltered

An analogous experiment was performed for (11) spoken Danish consonants

One, two or three spectral bands (each three-quarters of an octave wide) were used

Because the slit bandwidth was much wider than that used for the English consonants, consonant identification accuracy is considerably higher

This was done intentionally for reasons described shortly

Danish Consonant Identification)

750 Hz

1500

3000

85%}Three Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

39%

43%

42%

Single Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

30%

37%

34%

750 Hz

1500

3000

26%

33%

28%

< 24 Hz

< 12 Hz

750 Hz

3000

68%}< 24 Hz

Unfiltered

750 Hz

3000

59%}750 Hz

3000

67%}

750 Hz

3000

61%}< 12 Hz

< 12 Hz

< 24 Hz

750 Hz

1500

3000

82%}

750 Hz

1500

3000

80%}

< 24 Hz

< 12 Hz

NoModulationFiltering

ModulationFiltering

Unfiltered

Unfiltered

Unfiltered

750 Hz

1500

3000

85%}Three Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

39%

43%

42%

Single Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

30%

37%

34%

750 Hz

1500

3000

26%

33%

28%

< 24 Hz

< 12 Hz

750 Hz

3000

68%}< 24 Hz

Unfiltered

750 Hz

3000

59%}750 Hz

3000

67%}

750 Hz

3000

61%}< 12 Hz

< 12 Hz

< 24 Hz

750 Hz

1500

3000

82%}

750 Hz

1500

3000

80%}

< 24 Hz

< 12 Hz

NoModulationFiltering

ModulationFiltering

Unfiltered

Unfiltered

Unfiltered

The objective was to reach nearly (but not quite) perfect consonant recognition with (only) three slits (for reasons described shortly)

While achieving much lower consonant identification with single slits

This was to obtain a reasonably large dynamic range between the single-slit and multiple-slit conditions

This was done intentionally for reasons disclosed in short order

Danish Consonant Identification)The specific reason for structuring the stimuli in this fashion was in order to parametrically manipulate the modulation spectrum of individual slits

The modulation spectrum of single slits was low-pass filtered between 24 Hz and 5 Hz in order to ascertain the combined effect of spectro-temporal filtering on consonant identification

Slits marked in magenta were low-pass modulation filtered, while those in black were not modulation filtered

The percent correct recognition scores are not of particular interest

Except in one respect …

750 Hz

1500

3000

85%}Three Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

39%

43%

42%

Single Bands

750 Hz

3000

72%}Two Bands

750 Hz

1500

3000

30%

37%

34%

750 Hz

1500

3000

26%

33%

28%

< 24 Hz

< 12 Hz

750 Hz

3000

68%}< 24 Hz

Unfiltered

750 Hz

3000

59%}750 Hz

3000

67%}

750 Hz

3000

61%}< 12 Hz

< 12 Hz

< 24 Hz

750 Hz

1500

3000

82%}

750 Hz

1500

3000

80%}

< 24 Hz

< 12 Hz

NoModulationFiltering

ModulationFiltering

Unfiltered

Unfiltered

Unfiltered

750 Hz

1500

3000

18%

24%

20%

750 Hz

1500

3000

14%

18%

16%

< 8 Hz

< 5 Hz

750 Hz

3000

57%}

750 Hz

3000

52%}750 Hz

3000

56%}

750 Hz

3000

52%}

< 8 Hz

< 5 Hz

< 5 Hz

< 8 Hz

750 Hz

1500

3000

78%}

750 Hz

1500

3000

73%}

< 8 Hz

< 5 Hz

Conditions marked in magenta were low-pass modulation filtered

< 6 Hz

< 3 Hz

Phonetic Feature Specification – DanishAs with the English consonants, the Danish material can be

decomposed into constituent phonetic features

In order to perform a phonetic feature confusion analysis

VOICING MANNER PLACE

p – Stop Frontt – Stop Medialk – Stop Backb + Stop Frontd + Stop Medialg + Stop Backs – Fricative Medialf – Fricative Frontv + Fricative Frontm + Nasal Frontn + Nasal Medial

R2 = 0.98

R2 = 0.86

R2 = 0.94

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90

Consonant Identfication (percent correct)

Phonetic Feature Identification (percent correct)

Manner

Place

Voicing

Consonant Identification vs. Feature DecodingConsonant errors are not random – perceptual confusions are more

likely to occur for place of articulation than voicing or manner

Consonant recognition is, in fact, nearly perfectly correlated with the decoding of place of articulation information

Perceptual Confusion Matrix – Example

p t k b d g s f v m np 19 13 1 0 0 0 0 3 0 0 0t 1 32 2 0 1 0 0 0 0 0 0k 1 8 27 0 0 0 0 0 0 0 0b 0 0 0 25 10 0 0 0 0 1 0d 0 0 0 2 34 0 0 0 0 0 0g 0 0 1 5 7 22 0 0 1 0 0s 0 0 0 0 0 0 32 4 0 0 0f 0 1 0 0 1 0 23 10 1 0 0v 0 0 0 0 0 1 0 0 27 7 1m 0 0 0 0 0 0 0 0 1 24 4n 0 0 0 0 0 0 0 0 8 3 32

Response

Stimulus

Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner

As indicated by the yellow rectangles for common place of articulation confusions (within the same manner and voicing class)

Phonetic Feature Decoding AccuracyConsonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner

The specific pattern of confusions can be used to deduce decoding strategies used by listeners

Voiced Unvoiced Voiced 215 1 Unvoiced 3 77

99% Correct

Stop Fricative Nasal Stop 211 4 1 Fric 3 97 8 Nasal 0 9 63

94% Correct

Response

Stimulus

77% Correct

Voicing

Manner

Place

Dimension0.91

1.09

0.60

IT (Bits)

Information Transmitted ∑−=ji ij

jiij p

pppcT

,

log)(

Front Medial Back Front 125 53 2 Medial 11 131 2 Back 7 15 50

∞<24 Hz

<12 Hz<8 Hz

<5 Hz

2.77

2.632.52

2.46

2.292.40

2.17

1.981.90

1.58

2.4

2.2

1.851.81

1.65

1.28

1.14

0.82

0.49

0.36

1.09

0.91

0.72

0.55

0.38

1.31

1.05

0.73

0.440.39

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

0.75 + 1.5 + 3 kHz

Information Transmitted

(bits)

ModulationFrequencyCutoff

StimulusCondition

0.75 kHz

Slits marked in magenta were low-pass modulation filtered

Total Information TransmittedWhen total information transmitted is examined (essentially consonant ID), the patterns are systematic

Low pass modulation filtering single slits results in a progressive decline, whereas multi-slit stimuli are much less affected by such filtering

<6 Hz <3

Hz

Voicing Information Transmitted

∞< 24 Hz

< 12 Hz< 8 Hz

< 5 Hz

0.900.87

0.860.85 0.90

0.910.88

0.850.84

0.69

0.91

0.83

0.730.72

0.69

0.47

0.43

0.3

0.13

0.05

0.32

0.25

0.22

0.12

0.09

0.53

0.33

0.24

0.11

0.06

0.00

0.25

0.50

0.75

1.00

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

0.75 + 1.5 + 3 kHz

Information Transmitted

(bits)

ModulationFrequencyCutoff

StimulusCondition

0.75 kHz

Slits marked in magenta were low-pass modulation filtered

Three slits does not add information beyond what is achieved by two

The greatest decline in IT occurs below 5 Hz (3 kHz slit) and > 12 Hz (0.75 kHz slit)

Notice the relatively slight impact of modulation filtering on decoding of voicing information in two- and three-slit conditions

<3 Hz<6 Hz

Manner Information Transmitted

∞< 24 Hz

< 12 Hz< 8 Hz

< 5 Hz

1.28

1.17

1.041.14

0.95

1.09

0.96

0.810.81

0.62

1.09

1.01

0.820.82

0.73

0.5

0.37

0.22

0.070.06

0.48

0.40

0.26

0.16

0.09

0.51

0.4

0.22

0.10.09

0.00

0.25

0.50

0.75

1.00

1.25

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

0.75 + 1.5 + 3 kHz

Information Transmitted

(bits)

ModulationFrequencyCutoff

StimulusCondition

0.75 kHz

Slits marked in magenta were low-pass modulation filtered

Manner information is integrated in roughly linear fashion across the acoustic frequency spectrum

There is a progressive (but relatively slight) decline in manner decoding with low-pass modulation filtering

<3 Hz

<6 Hz

Place Information Transmitted

∞<24 Hz

<12 Hz<8 Hz

<5Hz

0.86

0.79 0.79

0.64

0.53

0.60

0.43

0.36

0.24

0.22

0.6

0.46

0.3

0.25

0.21

0.10.07

0.070.06

0.040.10

0.070.05

0.010.01

0.18

0.090.04

0.010.01

0.00

0.25

0.50

0.75

1.00

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

Information Transmitted

(bits)

StimulusCondition

0.75 kHz

Slits marked in magenta were low-pass modulation filtered

0.75 + 1.5 + 3 kHz

ModulationFrequencyCutoff

The integration of place information across the acoustic frequency spectrum is highly synergistic (and expansive)

Notice the low-pass modulation filtering has a significant impact on decoding of place information for the two-slit conditions

<3 Hz

<6 Hz

Phonetic Feature Information – All Dimensions

StimulusCondition

0.75 1.5 3 kHz

∞<24 Hz

<12 Hz<8 Hz

<5 Hz

2.77

2.632.52

2.46

2.292.40

2.17

1.981.90

1.58

2.4

2.2

1.851.81

1.65

1.28

1.14

0.82

0.49

0.36

1.09

0.91

0.72

0.55

0.38

1.31

1.05

0.73

0.440.39

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

Information Transmitted

(bits)

Modulation

FrequencyCutoff

0.75 kHz

Total

∞< 24 Hz

< 12 Hz< 8 Hz

< 5 Hz

0.900.87

0.860.85 0.90

0.910.88

0.850.84

0.69

0.91

0.83

0.730.72

0.69

0.47

0.43

0.3

0.13

0.05

0.32

0.25

0.22

0.12

0.09

0.53

0.33

0.24

0.11

0.06

0.00

0.25

0.50

0.75

1.00

Slits marked in magenta were low-pass modulation filtered

Voicing

Modulation

FrequencyCutoff

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

0.75 kHz

0.75 1.5 3 kHz

∞< 24 Hz

< 12 Hz< 8 Hz

< 5 Hz

1.28

1.17

1.041.14

0.95

1.09

0.96

0.810.81

0.62

1.09

1.01

0.820.82

0.73

0.5

0.37

0.22

0.070.06

0.48

0.40

0.26

0.16

0.09

0.51

0.4

0.22

0.10.09

0.00

0.25

0.50

0.75

1.00

1.25

∞<24 Hz

<12 Hz<8 Hz

<5Hz

0.86

0.79 0.79

0.64

0.53

0.60

0.43

0.36

0.24

0.22

0.6

0.46

0.3

0.25

0.21

0.10.07

0.070.06

0.040.10

0.070.05

0.010.01

0.18

0.090.04

0.010.01

0.00

0.25

0.50

0.75

1.00

Place MannerInformation Transmitted

(bits)

Modulation

FrequencyCutoff

Stimulus Condition

0.75 1.5 3 kHz

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

0.75 kHz

Stimulus Condition

0.75 1.5 3 kHz

1.5 kHz

3 kHz

0.75 + 3 kHz

0.75 + 3 kHz

0.75 kHz

Modulation

FrequencyCutoff

Place of articulation exhibits a pattern of cross-spectral integration distinct from voicing and manner – it requires a broader region of the audio and modulation spectrum

<6 Hz <3

Hz

<6 Hz <3

Hz

<6 Hz <3

Hz

<3 Hz

<6 Hz

Phonetic Features & Modulation SpectrumPhonetic features vary with respect to their modulation spectral properties

Place is associated with frequencies higher than 8 Hz

Manner is mostly associated with frequencies above 12 Hz and below 8 Hz

Voicing’s association with the modulation spectrum is frequency-specific; Below 8 Hz for high audio frequencies and above 12 Hz for low audio frequencies

0.930.950.921.050.990.930.890.931.06

0.950.750.760.76

0.780.770.911.031.031.241.29

0.910.931.03

1.281.2

0.680.690.70.760.82

1.081.121.14

1.35

1.221.081.08

1.11.39

1.09

0.850.82

0.82 0.96

0.86

2.112.33

2.012.19

1.91

2.11

1.77

1.43

11.05

2.252.262.4

2.17

1.82

0.0

0.5

1.0

1.5

2.0

2.5

Modulation

FrequencyCutoff (Hz)

Voicing

<12

Place

.75 + 3 kHz0.75+1.5+3

kHz

Observed IT/Predicted IT

(Linear Summation)

Stimulus Condition

MannerTotal

.75 + 3 kHz

.75 + 3 kHz

0.75+1.5+3 kHz

0.75+1.5+3 kHz

.75 + 3 kHz

0.75+1.5+3 kHz

.75 + 3 kHz

∞<24

<6<3

Values larger than 1 indicate greater than linear summation

Cross-channel Synergy (or not) The degree of cross-channel integration depends on the phonetic dimension

Voicing and manner are quasi-linear wrt to cross-channel integration

Place is highly synergistic in that the amount of information associated with two and three slits is far more than predicted on the basis of linear integration

The auditory system performs a spectro-temporal analysis in order to extract phonetic information from the acoustic speech signal

Summary

A detailed analysis of the audio (tonotopic) spectrum is not required to understand spoken language

Summary

However, a comprehensive sampling of the modulation characteristics of the speech signal across the audio spectrum is essential

Summary

This is particularly true for place of articulation information, which is crucial for decoding consonant identity

Summary

Place of articulation is the only phonetic feature whose information transmission increases expansively across the audio frequency spectrum

Summary

Moreover, place is the only dimension to be intensively based on the portion of the modulation spectrum above 8 Hz

Summary

The other phonetic dimensions, voicing and manner, are less tied to the modulation spectrum than place

Summary

Voicing is associated with low modulation frequencies (and high to a degree) in an audio-frequency-selective manner

Summary

Manner is associated with modulation frequencies above 12 Hz

Summary

This modulation spectral “division of labor” is consistent with an auditory analysis based on modulation maps (but is also consistent with other interpretations, such as a range of integration time constants)

Summary

The auditory system performs a spectro-temporal analysis in order to extract phonetic information from the acoustic speech signal

A detailed analysis of the audio (tonotopic) spectrum is not required to understand spoken language

However, a comprehensive sampling of the modulation characteristics of the speech signal across the audio spectrum is required

This is particularly true for place of articulation information, which is crucial for decoding consonant identity

Place of articulation is the only phonetic feature whose information transmission increases expansively across the audio frequency spectrum

Moreover, place is the only dimension to be intensively based on the portion of the modulation spectrum above 8 Hz

The other phonetic dimensions, voicing and manner, are less tied to the modulation spectrum than place

Voicing is associated with low modulation frequencies (and high to a degree)

Manner is associated with modulation frequencies above 12 Hz

This modulation spectral “division of labor” is consistent with an auditory analysis based on modulation maps (but is also consistent with other interpretations, such as a range of integration time constants)

Summary

For Additional Information

Consult the web site:

www.icsi.berkeley.edu/~steveng

Many Thanksfor

Your Time and Attention

Language – A Syllable-Centric PerspectiveAn empirically grounded perspective of spoken language focuses on the SYLLABLE and PROSODIC

ACCENT as the interface between “sound” and “meaning” (or at least lexical form)

Modes of Analysis

Fric

Voc V NasJ

Energy Time–FrequencyProsodic AccentPhonetic

InterpretationManner

Segmentation Word

“Seven”

Linguistic Tiers

Language - A Syllable-Centric PerspectiveA more empirically grounded perspective of spoken language focuses on

the SYLLABLE as the interface between “sound” “vision” and “meaning”

Important linguistic information is embedded in the TEMPORAL DYNAMICS of

the speech signal (irrespective of the

modality)

The Energy Arc Syllables are characterized by rises and falls in energy (see below, left)

The “energy arc” reflects both production and perception

From production’s perspective, the arc reflects the articulatory cycle from closure to maximally open aperture and back again (in crude terms)

From the ear’s perspective, the energy arc reflects the packaging of information within the temporal limits that the auditory system (and other sensory organs) has evolved to process

This temporal dimension is reflected in the modulation spectrum of spoken language (below, right)

Modulation SpectrumSpectro-temporal Profile

PLACE of articulation is, ironically, the most information-laden articulatory feature dimension in speech, and is inherently TRANS-SEGMENTAL, binding vocalic nuclei with preceding and/or following consonants

It is also the most stable phonetic dimension LINGUISTICALLY, although paradoxically, it is extremely vulnerable to acoustic interference when presented exclusively in the acoustic modality (i.e., without visual cues)

Place of Articulation

FRONTMEDIAL

BACK

The Energy Arc and VoicingWithin the traditional framework, voicing is considered a segmental property

A segment is either voiced or not

However, we know that this segmental perspective on voicing is only a crude caricature of the acoustic properties of speech

Many theoretically voiced segments are at least partially unvoiced

For example, in Am. English it is common for [z] to be unvoiced – particularly in syllable-final position in unaccented syllables

The so-called voiced obstruents ([b], [d], [g]) are usually realized as partially unvoiced (this is what voice-onset-time refers to), with various languages differing with respect to the specific values of VOT

This sort of behavior implies that voicing is NOT a segmental feature, but rather one that is under SYLLABIC control and actually reflects prosodic factors (which is WHY languages vary with respect to VOT)

How can this be so?

The Syllabic Control of VoicingRecall, that the core of the syllable – the nucleus – is almost always voiced

The nucleus is usually a vowel and contains the peak energy in the syllable

Voicing spreads from the nucleus forward in time to the coda, as well as backward to the onset

Voicing is continuous in time, and is associated with the higher-energy parts of the syllable

The lower-energy components of the syllable may or may not be voiced

But where the signal is unvoiced, the associated constituents reside in the “tails” of the syllable – the onset and/or coda

It is probably not a coincidence that the most linguistically informative components in speech are NOT associated with voicing

voiced voiced voicedvoi