the modulation spectrum – its role in sentence and consonant identification steven greenberg...
TRANSCRIPT
The Modulation Spectrum – Its Role in Sentence and Consonant
Identification
Steven GreenbergCentre for Applied Hearing ResearchTechnical University of Denmark
Silicon SpeechSanta Venetia, CA USA
http://www.icsi.berkeley.edu/[email protected]
Acknowledgements and ThanksResearch FundingU.S. National Science FoundationOtto Mønsted Foundation (Denmark)Danish Research CouncilTechnical University of Denmark (Torsten Dau)
Research CollaboratorsTakayuki AraiThomas ChristiansenRosaria Silipo
Effects of Reverberation on the Speech SignalReflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions
Yet, THE INTELLIGIBILITY OF SPEECH IS REMARKABLY STABLE
Implying that intelligibility is based NOT on fine spectral detail, but rather on some more basic parameter(s) – what might these be????
This presentation examines the origins of word intelligibility in the low-frequency (< 30 Hz) modulation properties of the acoustic speech signal
These modulation patterns, which reflect articulatory movement, are differentially distributed across the acoustic frequency spectrum
Intelligibility Based on Modulation Patterns
The specific configuration of the modulation patterns across the frequency spectrum reflects the essential cues for understanding spoken language
Intelligibility Based on Modulation Patterns
The acoustic frequency spectrum serves as a DISTRIBUTION MEDIUM for the modulation patterns
However, much of the ACOUSTIC SPECTRUM is, in fact, DISPENSIBLE (Harvey Fletcher, Jont Allen and others to the contrary)
Intelligibility Based on Modulation Patterns
Structure of the PresentationThis presentation will focus on the following issues
A sparse spectral representation of speech is ….
Sufficient for good intelligibility (though not entirely natural sounding, nor particularly robust in background noise)
Low-frequency modulations below 30 Hz appear to serve as the primary carriers of phonetic information in the speech signal
The role played by different parts of the modulation spectrum is at the outset unclear
The presentation will attempt to elucidate this question, both for sentences (first part) and for consonants (second part)
The distribution of modulation information across the audio-frequency (tonotopic) spectrum is also important (and will be addressed as well)
The perceptual data described may be useful for developing future-generation speech technology (e.g., automatic speech recognition and synthesis) germane to auditory prostheses (e.g., hearing aids and cochlear implants)
An Invariant Property of the Speech Signal?Houtgast and Steeneken demonstrated that the modulation spectrum, a temporal property, is highly predictive of speech intelligibility
This is significant, as it is difficult to degrade intelligibility through normal spectral distortion (many have tried, few have succeeded …. )
In highly reverberant environments, the modulation spectrum’s peak is highly attenuated, shifting down to ca. 2 Hz, the speech becoming increasingly difficult to comprehend
[based on an illustration by Hynek Hermansky]
Modulation Spectrum
Quantifying Modulation Patterns in SpeechThe modulation spectrum provides a quantitative method for computing
the amount of modulation in the speech signal
The technique is illustrated for a paradigmatic signal for clarity’s sake
The computation is performed for each spectral channel separately
The Modulation Spectrum Reflects SyllablesGiven the importance of the modulation spectrum for intelligibility, what does it reflect linguistically?
The distribution of syllable duration matches the modulation spectrum, suggesting that the integrity of the syllable is essential for understanding speech
Modulation spectrum of 15 minutes of spontaneous Japanese speech (OGI-TS corpus) compared with the syllable duration distribution for the same material (Arai and Greenberg, 1997)
Syllable duration(modulation frequency)
Modulation Spectrum
Comparable comparisons have been performed for (American) English
Intelligibility Derived from Modulation PatternsMany perceptual studies emphasize the importance of low-frequency modulation patterns for understanding spoken language
Historically, this was first demonstrated by Homer Dudley in 1939 with what has become known as the VOCODER – modulations higher than 25 Hz can be filtered out without significant impact on intelligibility
As mentioned earlier, Houtgast and Steeneken demonstrated that the low-frequency modulation spectrum is a good predictor of intelligibility in a wide range of acoustic listening environments (1970s and 1980s)
In the mid-1990s, Rob Drullman demonstrated the impact of low-pass filtering the modulation spectrum on intelligibility and segment identification – modulations below 8 Hz appeared to be most important
However, …. all of these studies were performed on broadband speech
There was no attempt to examine the interaction between temporal and spectral factors for coding speech information
(Other studies, such as those by Shannon and associates, have examined spectral-temporal interactions, but not at a fine level of detail)
Intelligibility Studies Using Spectral SlitsThe interaction between spectral and temporal information for coding speech information can be examined with some degree of precision using spectral slits
In what follows, the use of the term “spectral” refers to operations and processes in the acoustic frequency (i.e., tonotopic) domain
The term “temporal” or “modulation spectrum” refers to operations and processes that specifically involve low-frequency (< 30 Hz) modulations
First, we’ll examine the impact of extreme band-pass SPECTRAL filtering on intelligibility without consideration of the modulation spectrum
Intelligibility of Sparse Spectral SpeechThe spectrum of spoken sentences (TIMIT corpus) can be partitioned into
narrow (1/3-octave) channels (“slits”)
In the example below, there are four, one-third-octave slits distributed across the frequency spectrum
The edge of a slit is separated from its nearest neighbor by an octave
No single slit, by itself, is particularly intelligible
The intelligibility associated with any single slit is only 2 to 9%
Word Intelligibility - Single Slits
Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%
The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits
Intelligibility of Sparse Spectral SpeechTwo slits, when combined, provide a higher degree of
intelligibility, as shown on the following slides
Intelligibility of Sparse Spectral SpeechClearly, the degree of intelligibility depends on precisely where the slits are
situated in the frequency spectrum, as well as their relationship to each other
Spectrally contiguous slits may (or may not) be more intelligible than those far apart
Slits in the mid-frequency region, corresponding to the signal’s second formant, are the most intelligible of any two-slit combination
Intelligibility of Sparse Spectral SpeechThere is a marked improvement in intelligibility when three slits
are presented together
Particularly when the slits are spectrally contiguous
Intelligibility of Sparse Spectral SpeechFour slits combined yield nearly (but not quite) perfect
intelligibility
Intelligibility of Sparse Spectral SpeechThis was done intentionally in order that the contribution of each
slit could be precisely delineated
Without having to worry about “ceiling” effects for highly intelligible conditions
Modulation Spectrum Across FrequencyThe shape of the modulation spectrum is similar for the three lowest slits….
Modulation Spectrum Across FrequencyBut the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies
Raising the prospect that the mid-frequency modulation spectrum (10-30 Hz) may be important under certain conditions
Modulation Spectrum Across FrequencyThe high amount of energy in the mid-frequency MODULATION spectrum is typical of
material whose ACOUSTIC spectrum is higher than 3 kHz
And does not depend solely on the use of narrow spectral slices
As shown in this sample of OCTAVE-WIDE channels of broadband speech (or broader than an octave for the lowest sub-band)
TIMIT Corpus 40 sentences
Low-pass Modulation Filtering of SlitsThe MODULATION SPECTRUM of the spectral slits shown in previous slides can be LOW-PASS FILTERED in order to ascertain the relation between modulation patterns and their spectral affiliation
For simplicity’s sake, either the lowest two (slits 1 + 2) or highest two (3 + 4) slits were low-pass modulation filtered in tandem
Modulation Spectrum Across FrequencyEach sentence presented contained four spectral slits
Baseline performance – 4 slits without modulation filtering – was 87% intelligibility
The modulation spectrum was systematically low-pass filtered between 24 Hz and 3 Hz, in 3-Hz steps for each of the two-slit combinations, without modulation filtering the other two slits in the stimulus
Modulation Spectrum Across FrequencyThe general effect of low-pass modulation filtering is similar
Low-pass filtering below 12 Hz has a significant impact on intelligibility, which is particularly pronounced when the modulation spectrum is restricted to frequencies lower than 6 Hz
However, there is a significant difference in the impact of low-pass modulation filtering depending on whether the slits are in the low or high portion of the ACOUSTIC frequency spectrum
Modulation Spectrum Across FrequencyWhen the low-pass modulation filtered slits are in the low spectral frequencies (<1 kHz) there is a progressive decline of intelligibility
Moreover, low-pass modulation filtering above 15 Hz has no significant impact on intelligibility
In contrast, low-pass modulation filtering the high-frequency (>2 kHz) slits does impact intelligibility, even for a low-pass cutoff of 24 Hz
Modulation Spectrum Across FrequencyThis result implies that modulation frequencies higher than 24 Hz contribute to intelligibility, but only for the acoustic spectrum above 2 kHz
According to some, only modulation frequencies below 8 Hz contribute to intelligibility
However, recall that these other studies used full bandwidth speech signals
Low-pass filtering the modulation spectrum of such broadband stimuli does not necessarily remove the upper portion of the modulation spectrum
Much of the higher modulation spectrum could have been re-introduced through cross-channel phase distortion (as suggested by Ghitza, 2001)
In addition, the inherent redundancy of the full bandwidth signal makes it difficult to ascertain the specific contribution of each spectral region and modulation frequency
Some other method is required to tease apart the spectro-temporal components of intelligibility
Modulation Spectrum Across Frequency
The Story So FarThe low-frequency modulation spectrum is crucial for understanding spoken language
However, it is unclear precisely WHICH parts of the modulation spectrum contribute most heavily, and how much their contribution depends on their acoustic spectral (i.e., tonotopic) affiliation
The details are important for technical exploitation of these ideas (for application in hearing aids, automatic speech recognition and synthesis)
The Next Chapter – Consonant IdentificationBecause word intelligibility depends ultimately on listeners’ ability to decode phonetic information in the acoustic signal
A more fine-grained approach to the spectro-temporal foundations of speech processing may benefit from examining the ability to identify specific consonantal segments
By focusing on consonant identification it is possible to study certain aspects of auditory processing associated with speech understanding in greater detail (and with more precision) than is possible through intelligibility alone
Moreover, it is possible to decompose consonants into more elementary “building blocks” known as articulatory-acoustic (or phonetic) features
This phonetic decomposition into the fundamental phonetic dimensions of “voicing,” “manner of production,” and “place of articulation” provides some interesting insights into the auditory basis of speech processing
A Brief Introduction to Phonetic FeaturesThree principal articulatory dimensions are distinguished (among
others) – VOICING, MANNER and PLACE of articulation
As illustrated for a sample word, “Nap” [n ae p]
In order to correctly identify a consonant, all three principle phonetic feature dimensions need to be decoded correctly (at least in principle)
Place Voiced
Lightly Accented
[n] [ae] [p]
Voiced Unvoiced
Nasal Vocalic Stop
Alveolar (Medial) Bilabial (Front)
ProsodicAccent
Segment
Manner
Voicing
Place Front
Word “nap” (def: seminar activity)
The Arc’s Relation to Phonotactics & MannerIf we return to the basic question – WHY are syllables realized as
rises and falls of energy ….
And we make the simple assumption that each manner of production – vowel, fricative, nasal, stop etc. – is associated with a relative energy level
Vowels being highest
Stops and fricatives lowest
With nasals, liquids and glides in between
Then we gain some insight as to why the segments occur in the order they do within the syllable
The Energy Arc’s Relation to Syllable PhonotacticsIn effect, the segments reflect various manners of production, which
are associated with different energy levels
From the perspective of “command and control” the relation between syllable production and the energy arc is automatic and unconscious
Syllables are intrinsically arcs that are readily digested by the auditory system and the brain
This may account for why it is possible to articulate (and perceive) in terms of syllables, but not in terms of isolated phones (unless they are syllables themselves)
The Syllabic Control of Voicing – SignificanceThe most energetic components of the speech signal are usually voiced
Voicing helps to build up energy in the syllable
Voicing provides implicit structure for the syllable
This structure could be extremely important in decoding the speech signal, particularly in noisy environments
Recall the importance of fundamental-frequency information for separating concurrent talkers or distinguishing speech from a noisy background
Pitch-related cues could only play such an important role if the speech signal is largely voiced
voiced voiced voicedvoi
The Relation Between Voicing and MannerThus, voicing appears to cut across segmental boundaries
It only APPEARS to be associated with individual segments
Voicing serves to bind the segments into a syllabic whole through its temporal continuity
It is probably not coincidental that 80% (or more) of the speech signal is voiced
And that relatively few manner classes (usually stops, afficates, fricatives) can be realized as unvoiced (except in whispered or exaggerated speech)
Voicing is indirectly related to the energy arc, in that it is associated with the most intense components of the syllable, and is most robust to noise and reverberation
Thus, it is extremely important for decoding speech in noisy environments
Place of Articulation – the Key DimensionArticulatory place information is important for distinguishing among
syllables and words (particularly for consonants)
The distinction among [b], [d] and [g], and [p], [t] and [k] is primarily one of “place,” in that the location of maximum articulatory constriction varies from front to back
FRONTMEDIAL
BACK
Generally, there are only three distinct loci of constriction for any single manner class
Hence, the problem of determining articulatory place is greatly simplified if the manner of production is known
Manner-dependent place of articulation classifiers have been successfully applied in automatic phonetic transcription
(e.g., Chang, Wester & Greenberg, 2001, 2005)
Place of ArticulationThe formant patterns associated with place of articulation cues vary broadly over frequency and time
When speech is described as “dynamic” it is usually such formant patterns that are meant (this is a little misleading, in that syllable cues are also highly dynamic, but this is a separate story ….)
In low signal-to-noise ratio conditions and among the hearing impaired, place-of-articulation cues are usually among the first to degrade
Place of ArticulationThe reasons for this seeming vulnerability are controversial, but can be understood through analysis of data shown on the following slides
In this experiment, nonsense VC and CV (Am. English) syllables were presented to listeners, who were asked to identify the consonant
The syllables were spectrally filtered (in one-third octave bands), so that most of the spectrum was discarded
The proportion of consonants correctly recognized was scored as a function of the number of spectral slits presented and their frequency location, as shown on the next series of slides
The really interesting analysis comes afterwards (so please be patient) ….
Articulatory-Feature AnalysisThe results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves
They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition
Moreover, a more densely sampled spectrum results in higher recognition
However, we can perform a more detailed analysis by examining the pattern of errors made by listeners
From the confusion matrices we can ascertain precisely WHICH ARTICULATORY FEATURES are affected by the various manipulations imposed
And from this error analysis we can make certain deductions about the distribution of phonetic information across the tonotopic frequency axis potentially relevant to understanding why speech is most effectively communicated via a broad spectral carrier
The results, as scored in terms of raw consonant identification accuracy, are not particularly insightful (or interesting) in and of themselves
They show that the broader the spectral bandwidth of the slits, the more accurate is consonant recognition
Moreover, a more densely sampled spectrum results in higher recognition
The Bottom Line – So Far
The data can also be scored in terms of the proportion of phonetic features correctly decoded
However, we can perform a more detailed analysis by examining the pattern of errors made by listeners
Phonetic Feature Specification – English
In order to understand how this is done (and its significance) it is useful to examine a phonetic feature specification for the consonants involved
VOICING MANNER PLACE
p – Stop Frontt – Stop Medialk – Stop Backb + Stop Frontd + Stop Medialg + Stop Backs – Fricative Medialf – Fricative Frontv + Fricative Frontm + Nasal Frontn + Nasal Medialy + Glide Frontw* + Glide Back w* = +[round]
Phonetic Feature Specification – English
Perceptual Confusion Matrix – Example
p t k b d g s f v m np 19 13 1 0 0 0 0 3 0 0 0t 1 32 2 0 1 0 0 0 0 0 0k 1 8 27 0 0 0 0 0 0 0 0b 0 0 0 25 10 0 0 0 0 1 0d 0 0 0 2 34 0 0 0 0 0 0g 0 0 1 5 7 22 0 0 1 0 0s 0 0 0 0 0 0 32 4 0 0 0f 0 1 0 0 1 0 23 10 1 0 0v 0 0 0 0 0 1 0 0 27 7 1m 0 0 0 0 0 0 0 0 1 24 4n 0 0 0 0 0 0 0 0 8 3 32
Response
Stimulus
The pattern of identification errors can be used to deduce which phonetic features are more robust and which are most vulnerable to distortion
Perceptual Confusion Matrix – Example
p t k b d g s f v m np 19 13 1 0 0 0 0 3 0 0 0t 1 32 2 0 1 0 0 0 0 0 0k 1 8 27 0 0 0 0 0 0 0 0b 0 0 0 25 10 0 0 0 0 1 0d 0 0 0 2 34 0 0 0 0 0 0g 0 0 1 5 7 22 0 0 1 0 0s 0 0 0 0 0 0 32 4 0 0 0f 0 1 0 0 1 0 23 10 1 0 0v 0 0 0 0 0 1 0 0 27 7 1m 0 0 0 0 0 0 0 0 1 24 4n 0 0 0 0 0 0 0 0 8 3 32
Response
Stimulus
Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner
As indicated by the yellow rectangles for common place of articulation confusions (within the same manner and voicing class)
Phonetic-Feature/Consonant Identity – CorrelationConsonant recognition is
almost perfectly correlated with place-of-articulation performance
This correlation suggests that PLACE features are based on cues DISTRIBUTED across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower span of the spectrum
MANNER is also highly correlated with consonant recognition, implying that such features are extracted from a fairly broad portion of the spectrum as well
Let’s Go Danish (for Consonant Identification)
750 Hz
1500
3000
85%}Three Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
39%
43%
42%
Single Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
30%
37%
34%
750 Hz
1500
3000
26%
33%
28%
< 24 Hz
< 12 Hz
750 Hz
3000
68%}< 24 Hz
Unfiltered
750 Hz
3000
59%}750 Hz
3000
67%}
750 Hz
3000
61%}< 12 Hz
< 12 Hz
< 24 Hz
750 Hz
1500
3000
82%}
750 Hz
1500
3000
80%}
< 24 Hz
< 12 Hz
NoModulationFiltering
ModulationFiltering
Unfiltered
Unfiltered
Unfiltered
750 Hz
1500
3000
85%}Three Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
39%
43%
42%
Single Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
30%
37%
34%
750 Hz
1500
3000
26%
33%
28%
< 24 Hz
< 12 Hz
750 Hz
3000
68%}< 24 Hz
Unfiltered
750 Hz
3000
59%}750 Hz
3000
67%}
750 Hz
3000
61%}< 12 Hz
< 12 Hz
< 24 Hz
750 Hz
1500
3000
82%}
750 Hz
1500
3000
80%}
< 24 Hz
< 12 Hz
NoModulationFiltering
ModulationFiltering
Unfiltered
Unfiltered
Unfiltered
An analogous experiment was performed for (11) spoken Danish consonants
One, two or three spectral bands (each three-quarters of an octave wide) were used
Because the slit bandwidth was much wider than that used for the English consonants, consonant identification accuracy is considerably higher
This was done intentionally for reasons described shortly
Danish Consonant Identification)
750 Hz
1500
3000
85%}Three Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
39%
43%
42%
Single Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
30%
37%
34%
750 Hz
1500
3000
26%
33%
28%
< 24 Hz
< 12 Hz
750 Hz
3000
68%}< 24 Hz
Unfiltered
750 Hz
3000
59%}750 Hz
3000
67%}
750 Hz
3000
61%}< 12 Hz
< 12 Hz
< 24 Hz
750 Hz
1500
3000
82%}
750 Hz
1500
3000
80%}
< 24 Hz
< 12 Hz
NoModulationFiltering
ModulationFiltering
Unfiltered
Unfiltered
Unfiltered
750 Hz
1500
3000
85%}Three Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
39%
43%
42%
Single Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
30%
37%
34%
750 Hz
1500
3000
26%
33%
28%
< 24 Hz
< 12 Hz
750 Hz
3000
68%}< 24 Hz
Unfiltered
750 Hz
3000
59%}750 Hz
3000
67%}
750 Hz
3000
61%}< 12 Hz
< 12 Hz
< 24 Hz
750 Hz
1500
3000
82%}
750 Hz
1500
3000
80%}
< 24 Hz
< 12 Hz
NoModulationFiltering
ModulationFiltering
Unfiltered
Unfiltered
Unfiltered
The objective was to reach nearly (but not quite) perfect consonant recognition with (only) three slits (for reasons described shortly)
While achieving much lower consonant identification with single slits
This was to obtain a reasonably large dynamic range between the single-slit and multiple-slit conditions
This was done intentionally for reasons disclosed in short order
Danish Consonant Identification)The specific reason for structuring the stimuli in this fashion was in order to parametrically manipulate the modulation spectrum of individual slits
The modulation spectrum of single slits was low-pass filtered between 24 Hz and 5 Hz in order to ascertain the combined effect of spectro-temporal filtering on consonant identification
Slits marked in magenta were low-pass modulation filtered, while those in black were not modulation filtered
The percent correct recognition scores are not of particular interest
Except in one respect …
750 Hz
1500
3000
85%}Three Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
39%
43%
42%
Single Bands
750 Hz
3000
72%}Two Bands
750 Hz
1500
3000
30%
37%
34%
750 Hz
1500
3000
26%
33%
28%
< 24 Hz
< 12 Hz
750 Hz
3000
68%}< 24 Hz
Unfiltered
750 Hz
3000
59%}750 Hz
3000
67%}
750 Hz
3000
61%}< 12 Hz
< 12 Hz
< 24 Hz
750 Hz
1500
3000
82%}
750 Hz
1500
3000
80%}
< 24 Hz
< 12 Hz
NoModulationFiltering
ModulationFiltering
Unfiltered
Unfiltered
Unfiltered
750 Hz
1500
3000
18%
24%
20%
750 Hz
1500
3000
14%
18%
16%
< 8 Hz
< 5 Hz
750 Hz
3000
57%}
750 Hz
3000
52%}750 Hz
3000
56%}
750 Hz
3000
52%}
< 8 Hz
< 5 Hz
< 5 Hz
< 8 Hz
750 Hz
1500
3000
78%}
750 Hz
1500
3000
73%}
< 8 Hz
< 5 Hz
Conditions marked in magenta were low-pass modulation filtered
< 6 Hz
< 3 Hz
Phonetic Feature Specification – DanishAs with the English consonants, the Danish material can be
decomposed into constituent phonetic features
In order to perform a phonetic feature confusion analysis
VOICING MANNER PLACE
p – Stop Frontt – Stop Medialk – Stop Backb + Stop Frontd + Stop Medialg + Stop Backs – Fricative Medialf – Fricative Frontv + Fricative Frontm + Nasal Frontn + Nasal Medial
R2 = 0.98
R2 = 0.86
R2 = 0.94
40
50
60
70
80
90
100
10 20 30 40 50 60 70 80 90
Consonant Identfication (percent correct)
Phonetic Feature Identification (percent correct)
Manner
Place
Voicing
Consonant Identification vs. Feature DecodingConsonant errors are not random – perceptual confusions are more
likely to occur for place of articulation than voicing or manner
Consonant recognition is, in fact, nearly perfectly correlated with the decoding of place of articulation information
Perceptual Confusion Matrix – Example
p t k b d g s f v m np 19 13 1 0 0 0 0 3 0 0 0t 1 32 2 0 1 0 0 0 0 0 0k 1 8 27 0 0 0 0 0 0 0 0b 0 0 0 25 10 0 0 0 0 1 0d 0 0 0 2 34 0 0 0 0 0 0g 0 0 1 5 7 22 0 0 1 0 0s 0 0 0 0 0 0 32 4 0 0 0f 0 1 0 0 1 0 23 10 1 0 0v 0 0 0 0 0 1 0 0 27 7 1m 0 0 0 0 0 0 0 0 1 24 4n 0 0 0 0 0 0 0 0 8 3 32
Response
Stimulus
Consonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner
As indicated by the yellow rectangles for common place of articulation confusions (within the same manner and voicing class)
Phonetic Feature Decoding AccuracyConsonant errors are not random – perceptual confusions are more likely to occur with respect to place of articulation than to voicing or manner
The specific pattern of confusions can be used to deduce decoding strategies used by listeners
Voiced Unvoiced Voiced 215 1 Unvoiced 3 77
99% Correct
Stop Fricative Nasal Stop 211 4 1 Fric 3 97 8 Nasal 0 9 63
94% Correct
Response
Stimulus
77% Correct
Voicing
Manner
Place
Dimension0.91
1.09
0.60
IT (Bits)
Information Transmitted ∑−=ji ij
jiij p
pppcT
,
log)(
Front Medial Back Front 125 53 2 Medial 11 131 2 Back 7 15 50
∞<24 Hz
<12 Hz<8 Hz
<5 Hz
2.77
2.632.52
2.46
2.292.40
2.17
1.981.90
1.58
2.4
2.2
1.851.81
1.65
1.28
1.14
0.82
0.49
0.36
1.09
0.91
0.72
0.55
0.38
1.31
1.05
0.73
0.440.39
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
0.75 + 1.5 + 3 kHz
Information Transmitted
(bits)
ModulationFrequencyCutoff
StimulusCondition
0.75 kHz
Slits marked in magenta were low-pass modulation filtered
Total Information TransmittedWhen total information transmitted is examined (essentially consonant ID), the patterns are systematic
Low pass modulation filtering single slits results in a progressive decline, whereas multi-slit stimuli are much less affected by such filtering
<6 Hz <3
Hz
Voicing Information Transmitted
∞< 24 Hz
< 12 Hz< 8 Hz
< 5 Hz
0.900.87
0.860.85 0.90
0.910.88
0.850.84
0.69
0.91
0.83
0.730.72
0.69
0.47
0.43
0.3
0.13
0.05
0.32
0.25
0.22
0.12
0.09
0.53
0.33
0.24
0.11
0.06
0.00
0.25
0.50
0.75
1.00
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
0.75 + 1.5 + 3 kHz
Information Transmitted
(bits)
ModulationFrequencyCutoff
StimulusCondition
0.75 kHz
Slits marked in magenta were low-pass modulation filtered
Three slits does not add information beyond what is achieved by two
The greatest decline in IT occurs below 5 Hz (3 kHz slit) and > 12 Hz (0.75 kHz slit)
Notice the relatively slight impact of modulation filtering on decoding of voicing information in two- and three-slit conditions
<3 Hz<6 Hz
Manner Information Transmitted
∞< 24 Hz
< 12 Hz< 8 Hz
< 5 Hz
1.28
1.17
1.041.14
0.95
1.09
0.96
0.810.81
0.62
1.09
1.01
0.820.82
0.73
0.5
0.37
0.22
0.070.06
0.48
0.40
0.26
0.16
0.09
0.51
0.4
0.22
0.10.09
0.00
0.25
0.50
0.75
1.00
1.25
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
0.75 + 1.5 + 3 kHz
Information Transmitted
(bits)
ModulationFrequencyCutoff
StimulusCondition
0.75 kHz
Slits marked in magenta were low-pass modulation filtered
Manner information is integrated in roughly linear fashion across the acoustic frequency spectrum
There is a progressive (but relatively slight) decline in manner decoding with low-pass modulation filtering
<3 Hz
<6 Hz
Place Information Transmitted
∞<24 Hz
<12 Hz<8 Hz
<5Hz
0.86
0.79 0.79
0.64
0.53
0.60
0.43
0.36
0.24
0.22
0.6
0.46
0.3
0.25
0.21
0.10.07
0.070.06
0.040.10
0.070.05
0.010.01
0.18
0.090.04
0.010.01
0.00
0.25
0.50
0.75
1.00
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
Information Transmitted
(bits)
StimulusCondition
0.75 kHz
Slits marked in magenta were low-pass modulation filtered
0.75 + 1.5 + 3 kHz
ModulationFrequencyCutoff
The integration of place information across the acoustic frequency spectrum is highly synergistic (and expansive)
Notice the low-pass modulation filtering has a significant impact on decoding of place information for the two-slit conditions
<3 Hz
<6 Hz
Phonetic Feature Information – All Dimensions
StimulusCondition
0.75 1.5 3 kHz
∞<24 Hz
<12 Hz<8 Hz
<5 Hz
2.77
2.632.52
2.46
2.292.40
2.17
1.981.90
1.58
2.4
2.2
1.851.81
1.65
1.28
1.14
0.82
0.49
0.36
1.09
0.91
0.72
0.55
0.38
1.31
1.05
0.73
0.440.39
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
Information Transmitted
(bits)
Modulation
FrequencyCutoff
0.75 kHz
Total
∞< 24 Hz
< 12 Hz< 8 Hz
< 5 Hz
0.900.87
0.860.85 0.90
0.910.88
0.850.84
0.69
0.91
0.83
0.730.72
0.69
0.47
0.43
0.3
0.13
0.05
0.32
0.25
0.22
0.12
0.09
0.53
0.33
0.24
0.11
0.06
0.00
0.25
0.50
0.75
1.00
Slits marked in magenta were low-pass modulation filtered
Voicing
Modulation
FrequencyCutoff
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
0.75 kHz
0.75 1.5 3 kHz
∞< 24 Hz
< 12 Hz< 8 Hz
< 5 Hz
1.28
1.17
1.041.14
0.95
1.09
0.96
0.810.81
0.62
1.09
1.01
0.820.82
0.73
0.5
0.37
0.22
0.070.06
0.48
0.40
0.26
0.16
0.09
0.51
0.4
0.22
0.10.09
0.00
0.25
0.50
0.75
1.00
1.25
∞<24 Hz
<12 Hz<8 Hz
<5Hz
0.86
0.79 0.79
0.64
0.53
0.60
0.43
0.36
0.24
0.22
0.6
0.46
0.3
0.25
0.21
0.10.07
0.070.06
0.040.10
0.070.05
0.010.01
0.18
0.090.04
0.010.01
0.00
0.25
0.50
0.75
1.00
Place MannerInformation Transmitted
(bits)
Modulation
FrequencyCutoff
Stimulus Condition
0.75 1.5 3 kHz
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
0.75 kHz
Stimulus Condition
0.75 1.5 3 kHz
1.5 kHz
3 kHz
0.75 + 3 kHz
0.75 + 3 kHz
0.75 kHz
Modulation
FrequencyCutoff
Place of articulation exhibits a pattern of cross-spectral integration distinct from voicing and manner – it requires a broader region of the audio and modulation spectrum
<6 Hz <3
Hz
<6 Hz <3
Hz
<6 Hz <3
Hz
<3 Hz
<6 Hz
Phonetic Features & Modulation SpectrumPhonetic features vary with respect to their modulation spectral properties
Place is associated with frequencies higher than 8 Hz
Manner is mostly associated with frequencies above 12 Hz and below 8 Hz
Voicing’s association with the modulation spectrum is frequency-specific; Below 8 Hz for high audio frequencies and above 12 Hz for low audio frequencies
0.930.950.921.050.990.930.890.931.06
0.950.750.760.76
0.780.770.911.031.031.241.29
0.910.931.03
1.281.2
0.680.690.70.760.82
1.081.121.14
1.35
1.221.081.08
1.11.39
1.09
0.850.82
0.82 0.96
0.86
2.112.33
2.012.19
1.91
2.11
1.77
1.43
11.05
2.252.262.4
2.17
1.82
0.0
0.5
1.0
1.5
2.0
2.5
Modulation
FrequencyCutoff (Hz)
Voicing
<12
Place
.75 + 3 kHz0.75+1.5+3
kHz
Observed IT/Predicted IT
(Linear Summation)
Stimulus Condition
MannerTotal
.75 + 3 kHz
.75 + 3 kHz
0.75+1.5+3 kHz
0.75+1.5+3 kHz
.75 + 3 kHz
0.75+1.5+3 kHz
.75 + 3 kHz
∞<24
<6<3
Values larger than 1 indicate greater than linear summation
Cross-channel Synergy (or not) The degree of cross-channel integration depends on the phonetic dimension
Voicing and manner are quasi-linear wrt to cross-channel integration
Place is highly synergistic in that the amount of information associated with two and three slits is far more than predicted on the basis of linear integration
The auditory system performs a spectro-temporal analysis in order to extract phonetic information from the acoustic speech signal
Summary
A detailed analysis of the audio (tonotopic) spectrum is not required to understand spoken language
Summary
However, a comprehensive sampling of the modulation characteristics of the speech signal across the audio spectrum is essential
Summary
This is particularly true for place of articulation information, which is crucial for decoding consonant identity
Summary
Place of articulation is the only phonetic feature whose information transmission increases expansively across the audio frequency spectrum
Summary
Moreover, place is the only dimension to be intensively based on the portion of the modulation spectrum above 8 Hz
Summary
The other phonetic dimensions, voicing and manner, are less tied to the modulation spectrum than place
Summary
Voicing is associated with low modulation frequencies (and high to a degree) in an audio-frequency-selective manner
Summary
This modulation spectral “division of labor” is consistent with an auditory analysis based on modulation maps (but is also consistent with other interpretations, such as a range of integration time constants)
Summary
The auditory system performs a spectro-temporal analysis in order to extract phonetic information from the acoustic speech signal
A detailed analysis of the audio (tonotopic) spectrum is not required to understand spoken language
However, a comprehensive sampling of the modulation characteristics of the speech signal across the audio spectrum is required
This is particularly true for place of articulation information, which is crucial for decoding consonant identity
Place of articulation is the only phonetic feature whose information transmission increases expansively across the audio frequency spectrum
Moreover, place is the only dimension to be intensively based on the portion of the modulation spectrum above 8 Hz
The other phonetic dimensions, voicing and manner, are less tied to the modulation spectrum than place
Voicing is associated with low modulation frequencies (and high to a degree)
Manner is associated with modulation frequencies above 12 Hz
This modulation spectral “division of labor” is consistent with an auditory analysis based on modulation maps (but is also consistent with other interpretations, such as a range of integration time constants)
Summary
Language – A Syllable-Centric PerspectiveAn empirically grounded perspective of spoken language focuses on the SYLLABLE and PROSODIC
ACCENT as the interface between “sound” and “meaning” (or at least lexical form)
Modes of Analysis
Fric
Voc V NasJ
Energy Time–FrequencyProsodic AccentPhonetic
InterpretationManner
Segmentation Word
“Seven”
Linguistic Tiers
Language - A Syllable-Centric PerspectiveA more empirically grounded perspective of spoken language focuses on
the SYLLABLE as the interface between “sound” “vision” and “meaning”
Important linguistic information is embedded in the TEMPORAL DYNAMICS of
the speech signal (irrespective of the
modality)
The Energy Arc Syllables are characterized by rises and falls in energy (see below, left)
The “energy arc” reflects both production and perception
From production’s perspective, the arc reflects the articulatory cycle from closure to maximally open aperture and back again (in crude terms)
From the ear’s perspective, the energy arc reflects the packaging of information within the temporal limits that the auditory system (and other sensory organs) has evolved to process
This temporal dimension is reflected in the modulation spectrum of spoken language (below, right)
Modulation SpectrumSpectro-temporal Profile
PLACE of articulation is, ironically, the most information-laden articulatory feature dimension in speech, and is inherently TRANS-SEGMENTAL, binding vocalic nuclei with preceding and/or following consonants
It is also the most stable phonetic dimension LINGUISTICALLY, although paradoxically, it is extremely vulnerable to acoustic interference when presented exclusively in the acoustic modality (i.e., without visual cues)
Place of Articulation
FRONTMEDIAL
BACK
The Energy Arc and VoicingWithin the traditional framework, voicing is considered a segmental property
A segment is either voiced or not
However, we know that this segmental perspective on voicing is only a crude caricature of the acoustic properties of speech
Many theoretically voiced segments are at least partially unvoiced
For example, in Am. English it is common for [z] to be unvoiced – particularly in syllable-final position in unaccented syllables
The so-called voiced obstruents ([b], [d], [g]) are usually realized as partially unvoiced (this is what voice-onset-time refers to), with various languages differing with respect to the specific values of VOT
This sort of behavior implies that voicing is NOT a segmental feature, but rather one that is under SYLLABIC control and actually reflects prosodic factors (which is WHY languages vary with respect to VOT)
How can this be so?
The Syllabic Control of VoicingRecall, that the core of the syllable – the nucleus – is almost always voiced
The nucleus is usually a vowel and contains the peak energy in the syllable
Voicing spreads from the nucleus forward in time to the coda, as well as backward to the onset
Voicing is continuous in time, and is associated with the higher-energy parts of the syllable
The lower-energy components of the syllable may or may not be voiced
But where the signal is unvoiced, the associated constituents reside in the “tails” of the syllable – the onset and/or coda
It is probably not a coincidence that the most linguistically informative components in speech are NOT associated with voicing
voiced voiced voicedvoi