linguistic dissection of switchboard-corpus automatic speech recognition systems steven greenberg...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Linguistic Dissection of
Switchboard-CorpusAutomatic Speech Recognition Systems
Steven Greenberg and Shawn ChangInternational Computer Science Institute1947 Center Street, Berkeley, CA 94704
ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, Paris, September 18-20, 2000
Acknowledgements and Thanks• EVALUATION DESIGN SUPPORT
– George Doddington and Jack Godfrey• ANALYSIS SUPPORT
– Leah Hitchcock, Joy Hollenback and Rosaria Silipo• SC-LITE SUPPORT
– Jon Fiscus• FUNDING SUPPORT
– U.S. Department of Defense
• PROSODIC LABELING– Jeff Good and Leah Hitchcock
• PHONETIC LABELING AND SEGMENTATION– Candace Cardinal, Rachel Coulston and Colleen Richey
• DATA SUBMISSION– AT&T – BBN– DRAGON SYSTEMS– CAMBRIDGE UNIVERSITY– JOHNS HOPKINS UNIVERSITY– MISSISSIPPI STATE UNIVERSITY– SRI INTERNATIONAL– UNIVERSITY OF WASHINGTON
• SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES WERE EVALUATED WITH RESPECT TO PHONE- AND
WORD- LEVEL CLASSIFICATION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL
• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS
– Decision-tree analyses support this hypothesis– Additional analyses are also consistent with this conclusion
• SYLLABLE STRUCTURE AND PROSODIC STRESS ARE ALSO IMPORTANT FACTORS FOR ACCURATE RECOGNITION
– The pattern of errors differs across the syllable (onset, nucleus, coda)– Stress affects primarily the number of word-deletion errors
• SPEAKING RATE CAN BE USED TO PREDICT RECOGNITION ERROR– Syllables per second is a far more accurate metric than MRATE (an acoustic
measure based on the modulation spectrum)
• ASR SYSTEMS CAN POTENTIALLY BE IMPROVED BY FOCUSING MORE ATTENTION ON PHONETIC CLASSIFICATION, SYLLABLE STRUCTURE AND PROSODIC STRESS
Take Home Messages
• THE EIGHT ASR SYSTEMS WERE EVALUATED USING A 1-HOUR SUBSET OF THE SWITCHBOARD TRANSCRIPTION CORPUS
– This corpus had been hand-labeled at the phone, syllable, word and prosodic stress levels and hand-segmented at the syllabic and word levels. 25% of the material was hand-segmented at the phone level and the remainder quasi-automatically segmented into phonetic segments (and verified)
– The phonetic segments of each site were mapped to a common reference set, enabling a detailed analysis of the phone and word errors for each site that would otherwise be difficult to perform
• THIS EVALUATION REQUIRED THE CONVERSION OF THE ORIGINAL SUBMISSION MATERIAL TO A COMMON REFERENCE FORMAT
– The common format was required for scoring using SC-Lite and to perform certain types of statistical analyses
– Key to the conversion was the use of TIME-MEDIATED parsing which provides the capability of assigning different outputs to the same reference unit (be it word, phone or other)
• THE RECOGNITION MATERIAL WAS TAGGED AT THE PHONE AND WORD LEVELS WITH ca. 40 SEPARATE LINGUISTIC PARAMETERS– This information pertains to the acoustic, phonetic, lexical, utterance and
speaker characteristics of the material and are formatted into “BIG LISTS”
Overview - 1
• MOST OF THE RECOGNITION MATERIAL AND CONVERTED FILES, AS WELL AS THE SUMMARY (“BIG”) LISTS, ARE AVAILABLE ON THE
WORLD WIDE WEB:http://www.ices.berkeley.edu/real/phoneval
• THE ANALYSES SUGGEST THE FOLLLOWING:
PHONETIC CLASSIFICATION APPEARS TO BE AN IMPORTANT FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Decision-tree analyses of the big lists support this hypothesis– Additional statistical analyses are also consistent with this conclusion
SYLLABLE STRUCTURE AND PROSODIC STRESS ARE ALSO IMPORTANT FACTORS FOR ACCURATE RECOGNITION– The pattern of errors differs across the syllable (onset, nucleus, coda)– Stress affects primarily the rate of word-deletion errors
FAST/SLOW SPEAKING RATE IS CORRELATED WITH WORD ERROR– Syllables per second is a far more accurate metric than MRATE (an acoustic
measure based on the modulation spectrum)
Overview - 2
• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Switchboard contains informal telephone dialogues
– Nearly one hour’s material that had previously been phonetically transcribed (by highly trained phonetics students from UC-Berkeley)
– All of this material was hand-segmented at either the phonetic-segment or syllabic level by the transcribers
– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified.
• THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT ARE AVAILABLE ON THE PHONEVAL WEB SITE:
http://www.icsi.berkeley.edu/real/phoneval
• THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL ARE AVAILABLE AT:
http://www.icsi.berkeley.edu/real/stp
Evaluation Materials
• ALL 674 FILES IN THE DIAGNOSTIC EVALUATION MATERIAL WERE PROSODICALLY LABELED
• THE LABELERS WERE TWO UC-BERKELEY LINGUISTICS STUDENTS
• ALL SYLLABLES WERE MARKED WITH RESPECT TO:
– Primary Stress
– Complete Lack of Stress (no explicit label)
– Intermediate Stress
• INTERLABELER AGREEMENT WAS HIGH
– 95% Agreement with Respect to Stress (78% for Primary Stress)– 85% Agreement for Unstressed Syllables
• THE PROSODIC TRANSCRIPTION MATERIAL IS AVAILABLE AT:
http://www.icsi.berkeley.edu/~steveng/prosody
Prosodic Material
Evaluation Material Characteristics
0
50
100
150
200
250
300
V_Easy Easy Medium Hard V_Hard
Subjective Difficulty
By Subjective Difficulty
0
20
40
60
80
100
120
140
160
180
S_Mid N_Mid N_East West South NYC (Other)
Dialect Region
Nu
mb
er o
f U
tter
ance
s
By Dialect Region
• AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS
• BROAD DISTRIBUTION OF UTTERANCE DURATIONS– 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10%
• COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD
• A WIDE RANGE OF DISCUSSION TOPICS
• VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)
• EIGHT SITES PARTICIPATED IN THE EVALUATION– All eight provided material for the unconstrained-recognition phase– Six sites also provided sufficient forced-alignment-recognition
material (i.e., phone/word labels and segmentation given the word transcript for each utterance) for a detailed analysis
• AT&T (forced-alignment recognition incomplete, not analyzed )
• Bolt, Beranek and Newman
• Cambridge University
• Dragon (forced-alignment recognition incomplete, not analyzed )
• Johns Hopkins University
• Mississippi State University
• SRI International
• University of Washington
Evaluation Sites
Parameter Key
START - Begin time (in seconds) of phone
DUR - Duration (in sec) of phone
PHN - Hypothesized phone ID
WORD - Hypothesized Word ID
Format is for all 674 files in the evaluation set
(Example courtesy of MSU)
Initial Recognition File - ExampleUTT-ID CH Start DUR PHN WORD
2001_0016 B 0 0.1 sil !SENT_START
2001_0016 B 0.1 0.06 l
2001_0016 B 0.16 0.05 ay
2001_0016 B 0.21 0.07 k LIKE
2001_0016 B 0.28 0.04 ih
2001_0016 B 0.32 0.05 n IN
2001_0016 B 0.37 0.21 ao
2001_0016 B 0.58 0.08 g
2001_0016 B 0.66 0.08 ax
2001_0016 B 0.74 0.03 s
2001_0016 B 0.77 0.04 t
2001_0016 B 0.81 0.01 sp AUGUST
2001_0016 B 0.82 0.03 w
2001_0016 B 0.85 0.03 eh
2001_0016 B 0.88 0.04 n WHEN
2001_0016 B 0.92 0.05 eh
2001_0016 B 0.97 0.03 v
2001_0016 B 1 0.03 r
2001_0016 B 1.03 0.05 iy
2001_0016 B 1.08 0.06 b
2001_0016 B 1.14 0.05 aa
2001_0016 B 1.19 0.03 d
2001_0016 B 1.22 0.03 iy EVERYBODY
• EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET– Most of the phone sets are available on the PHONEVAL web site
• THE SITES’ PHONE SETS WERE MAPPED TO A COMMON “REFERENCE” PHONE SET – The reference phone set is based on the ICSI Switchboard
Transcription material (STP), but is adapted to match the less granular symbol sets used by the submission sites
– The set of mapping conventions relating the STP (and reference) sets are also available on the PHONEVAL web site
• THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SUBMISSION SITE PHONE SETS
– This reverse mapping was done in order to insure that variants of a phone were given due “credit” in the scoring procedure
– For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set
Phone Mapping Procedure
Generation of Evaluation Data - 1
• EACH SITE’S MATERIAL WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE)
CTM File Format for Word Scoring
SOURCE UTID SIDE START DUR WORD ERTYP
REFERENCE 2001-B-0016 B 0 0.11 ? NHYPOTHESIS 2001-B-0016 B *** *** *** N
R 2001-B-0016 B 0.11 0.18 LIKE CH 2001-B-0016 B 0.1 0.18 LIKE C
R 2001-B-0016 B 0.29 0.08 IN CH 2001-B-0016 B 0.28 0.09 IN C
R 2001-B-0016 B 0.37 0.48 AUGUST CH 2001-B-0016 B 0.37 0.45 AUGUST C
R 2001-B-0016 B 0.85 0.07 WHEN CH 2001-B-0016 B 0.82 0.1 WHEN C
R 2001-B-0016 B 0.92 0.44 EVERYBODY_IS SH 2001-B-0016 B 0.92 0.33 EVERYBODY S
R 2001-B-0016 B *** *** *** IH 2001-B-0016 B 1.25 0.1 IS I
R 2001-B-0016 B 1.36 0.15 ON CH 2001-B-0016 B 1.35 0.15 ON C
… … … … … … …
ERROR KEY
C = CORRECTI = INSERTION N = NULL ERRORS = SUBSTITUTION
Generation of Evaluation Data - 2
• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance
• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context
Summary of Corpus Statistical Analyses• PHONE PROPERTIES
– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices
• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration
Word- and Phone-Centric “Big Lists”
ERR REFWORD HYPWORD UTID WORDPOS WORDFREQ WRDENG MRATE SYLRATE ETC.
N ? *** 2001-B-0016 0 -6.02 0.92 5.05 6.56 …C LIKE LIKE 2001-B-0016 0.06 -2.1522 1.04 5.05 6.56 …C IN IN 2001-B-0016 0.11 -1.9295 0.97 5.05 6.56 …C AUGUST AUGUST 2001-B-0016 0.17 -4.6678 1.1 5.05 6.56 …C WHEN WHEN 2001-B-0016 0.22 -2.5432 0.97 5.05 6.56 …C EVERYBODY'S EVERYBODY'S 2001-B-0016 0.28 -4.3253 1.02 5.05 6.56 …C ON ON 2001-B-0016 0.33 -2.3138 0.97 5.05 6.56 …C VACATION VACATION 2001-B-0016 0.39 -3.9967 0.95 5.05 6.56 …C OR OR 2001-B-0016 0.44 -2.3202 0.84 5.05 6.56 …C SOMETHING SOMETHING 2001-B-0016 0.5 -2.7438 0.81 5.05 6.56 …C WE WE 2001-B-0016 0.56 -2.1082 0.88 5.05 6.56 …C CAN CAN 2001-B-0016 0.61 -2.611 0.75 5.05 6.56 …C DRESS DRESS 2001-B-0016 0.67 -4.0399 0.9 5.05 6.56 …C A A 2001-B-0016 0.72 -1.6723 0.85 5.05 6.56 …C LITTLE LITTLE 2001-B-0016 0.78 -2.7814 0.91 5.05 6.56 …C MORE MORE 2001-B-0016 0.83 -2.7027 0.85 5.05 6.56 …C CASUAL CASUAL 2001-B-0016 0.89 -4.6678 0.94 5.05 6.56 …I *** !SILENCE 2001-B-0016 0.94 -6.02 0.6 5.05 6.56 …N H# *** 2005-B-0077 0 -6.02 0.6 4.44 7 …N ? *** 2005-B-0077 0.06 -6.02 0.92 4.44 7 …C YEAH YEAH 2005-B-0077 0.12 -1.9361 0.99 4.44 7 …C JUST JUST 2005-B-0077 0.18 -2.1809 0.94 4.44 7 …C BECAUSE BECAUSE 2005-B-0077 0.24 -2.4782 1.09 4.44 7 …… … … … … … … … … …
• THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55-65 SEPARATE PARAMETERS ASSOCIATED WITH PHONES, SYLLABLES, WORD, UTTERANCES AND SPEAKERS SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE
Generation of Evaluation Data - 3
The Switchboard Evaluation Web SiteRECOGNITION FILES•Converted Submissions
ATT, BBN , JHU, MSU, SRI, WASH
•Word Level Recognition ErrorsATT, CU, BBN , JHU, MSU, SRI, WASH
•Phone Error (Free Recognition)ATT, BBN, JHU, MSU, WASH •Word Recognition Phone Mapping
ATT, BBN, JHU, MSU, WASH
BIG LISTS•Word-Centric
ATT, CU, BBN, JHU, MSU, SRI, WASH
•Phone-CentricATT, BBN, JHU, MSU, WASH
•Phonetic Confusion MatricesATT, BBN, JHU, MSU, WASH
FORCED ALIGNMENT FILES•Forced Alignment Files
BBN , JHU, MSU, WASH
•Word-Level Alignment ErrorsBBN , CU, JHU, MSU, SRI, WASH
•Phone Error (Forced Alignment)CU, BBN, JHU, MSU, SRI, WASH •Alignment Word-Phone Mapping
BBN , JHU, MSU, WASH
BIG LISTS•Word-Centric
BBN, CU, JHU, MSU, SRI, WASH
•Phone-CentricBBN, JHU, MSU, WASH
•Phonetic Confusion MatricesBBN, JHU, MSU, WASH
•Description of the STP Phone Set•STP Transcription Material
Phone-Word Reference
Syllable-Word Reference
•Phone Mapping for Each SiteATT, BBN , JHU, MSU, WASH
STP-to-Reference Map
STP Phone-to-Articulatory-Feature Map
http://www.icsi.berkeley.edu/real/phoneval
Phone Error - Unconstrained RecognitionE
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
ATT
BBN
CU
DRAGON
JHU
MSU
SRI
WASH
Site
• PHONE ERROR RATES VARY BETWEEN 39% AND 55%–Substitutions are the major source of phone errors
Phone Error - Forced Alignment
0
0.1
0.2
0.3
0.4
0.5
TOTAL SUB DEL INS
BBN
CU
JHU
MSU
SRI
WASH
Err
or
Ra
te
Error Type
AT&T, Dragon did not provide a complete set of forced alignments
Site
• PHONE ERROR RATES VARY BETWEEN 35% AND 49%–Insertions as well as substitutions are a major source of errors
Word Recognition ErrorE
rro
r R
ate
Error Type
0
0.1
0.2
0.3
0.4
0.5
0.6
TOTAL SUB DEL INS
ATT
BBN
CU
DRAGON
JHU
MSU
SRI
WASH
Site
• WORD ERROR RATES VARY BETWEEN 27% AND 43%–Substitutions are the major source of word errors
Are Word and Phone Errors Related?• COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE
–The correlation between the two parameters is 0.78
0
0.1
0.2
0.3
0.4
0.5
0.6
JHUATTCUSRIDRAGMSUUWBBN
Phone Error
Word Error
Phone ErrorWord Error
Submission Site
Error Rate
Pronunciation Models?
r = 0.78
The differential error rate is
probably related to the use of
either pronunciation or
language models (or both)
0.8873 0.1127AVGAFDIST < 4.835: 0.9069 0.09306| PHNCOR < 1.5: 0.7811 0.2189| | PHNINS < 0.5: 0.6763 0.3237| | | PHNSUB < 1.5: 0.5519 0.4481| | | | POSTWORDERR in S,D, : 0.3223 0.6777| | | | | PHNCOR < 0.5: 0.1296 0.8704| | | | | PHNCOR >= 0.5: 0.4485 0.5515| | | | | | WORDENGY < 0.995: 0.3448
0.6552| | | | | | WORDENGY >= 0.995: 0.6939
0.3061| | | | POSTWORDERR in C,I,N, : 0.7204 0.2796| | | | | AVGAFDIST < 3.165: 0.7994 0.20061 2 3 4 5 6 CANO-PHNCNT < 2.5: 0.8462 0.1538| | | | | | CANO-PHNCNT >= 2.5: 0.3214
0.6786| | | | | AVGAFDIST >= 3.165: 0.2931 0.7069| | | | | | REFDUR < 0.125: 0.08108 0.9189| | | | | | REFDUR >= 0.125: 0.6667 0.3333| | | PHNSUB >= 1.5: 0.8394 0.1606| | PHNINS >= 0.5: 0.9283 0.07169| PHNCOR >= 1.5: 0.9842 0.01578AVGAFDIST >= 4.835: 0.1016 0.8984| PHNINS < 0.5: 0.05785 0.9421| PHNINS >= 0.5: 0.8571 0.1429
Decision Tree Analysis Example
+4652+591AVGAFDIST < 4.835: +4639+476| PHNCOR < 1.5: +1520+426| | PHNINS < 0.5: +769+368| | | PHNSUB < 1.5: +356+289| | | | POSTWORDERR in S,D, : +88+185| | | | | PHNCOR < 0.5: +14+94| | | | | PHNCOR >= 0.5: +74+91| | | | | | WORDENGY < 0.995: +40+76| | | | | | WORDENGY >= 0.995: +34+15| | | | POSTWORDERR in C,I,N, :
+268+104| | | | | AVGAFDIST < 3.165: +251+631 2 3 4 5 6 CANO-PHNCNT < 2.5: +242+44| | | | | | CANO-PHNCNT >= 2.5: +9+19| | | | | AVGAFDIST >= 3.165: +17+41| | | | | | REFDUR < 0.125: +3+34| | | | | | REFDUR >= 0.125: +14+7| | | PHNSUB >= 1.5: +413+79| | PHNINS >= 0.5: +751+58| PHNCOR >= 1.5: +3119+50AVGAFDIST >= 4.835: +13+115| PHNINS < 0.5: +7+114| PHNINS >= 0.5: +6+1
• DELETIONS VS. EVERTHING ELSE - WORD LEVEL (MSU)
PROPORTIONS INSTANCES
N = 5243
Level
Decision Tree Analysis Example
+4652+591AVGAFDIST < 4.835: +4639+476| PHNCOR < 1.5: +1520+426| | PHNINS < 0.5: +769+368| | | PHNSUB < 1.5: +356+289| | | | POSTWORDERR in S,D, : +88+185| | | | | PHNCOR < 0.5: +14+94| | | | | PHNCOR >= 0.5: +74+91| | | | | | WORDENGY < 0.995: +40+76| | | | | | WORDENGY >= 0.995: +34+15| | | | POSTWORDERR in C,I,N, :
+268+104| | | | | AVGAFDIST < 3.165: +251+631 2 3 4 5 6 CANO-PHNCNT < 2.5: +242+44| | | | | | CANO-PHNCNT >= 2.5: +9+19| | | | | AVGAFDIST >= 3.165: +17+41| | | | | | REFDUR < 0.125: +3+34| | | | | | REFDUR >= 0.125: +14+7| | | PHNSUB >= 1.5: +413+79| | PHNINS >= 0.5: +751+58| PHNCOR >= 1.5: +3119+50AVGAFDIST >= 4.835: +13+115| PHNINS < 0.5: +7+114| PHNINS >= 0.5: +6+1
• DELETIONS VS. EVERTHING ELSE - WORD LEVEL (MSU)INSTANCES
N = 5243
LevelNumber of Nodes at a Given Level
Number of Instances at a Given LevelFEATURE TOTAL LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5 LEVEL 6 LEVEL 7
AVGAFDIST 5615 5243 0 0 0 0 372 0
PHNCOR 5388 0 5115 0 0 0 273 0
PHNINS 2074 0 128 1946 0 0 0 0
PHNSUB 1137 0 0 0 1137 0 0 0
POSTWORDERR 645 0 0 0 0 645 0 0
CANO-PHNCNT 314 0 0 0 0 0 0 314
WORDENGY 165 0 0 0 0 0 0 165
REFDUR 58 0 0 0 0 0 0 58
FEATURE TOTAL LEVEL 1 LEVEL 2 LEVEL 3 LEVEL 4 LEVEL 5 LEVEL 6 LEVEL 7
AVGAFDIST 4 2 0 0 0 0 2 0
PHNCOR 4 0 2 0 0 0 2 0
PHNINS 4 0 2 2 0 0 0 0
CANO-PHNCNT 2 0 0 0 0 0 0 2
POSTWORDERR 2 0 0 0 0 2 0 0
WORDENGY 2 0 0 0 0 0 0 2
PHNSUB 2 0 0 0 2 0 0 0
REFDUR 2 0 0 0 0 0 0 2
• PHONE-BASED PARAMETERS DOMINATE THE TREES ….
• WORD SUBSTITUTIONS VERSUS EVERYTHING ELSEATT - phnsub, wordfreq, avgAFdist, beginoff, endoff, postworderrBBN - postworderr, preworderr, avgAFdist, phnsub, wordfreq, hypdurCU - preworderr, phnsub, wordfreqDragon - phnsub, preworderr, postworderrMSU - phnsub, avgAFdist, hypdur, postworderr, beginoffJHU - phnsub, wordfreq, cano-sylcntSRI - postworderr, phnsub, avgAFdist, wordfreq, hypdurWASH - phnsub, wordfreq, avgAFdist, postworderr, avgphnfreq
• WORD DELETIONS VERSUS EVERYTHING ELSEATT - phncor, avgAFdist, postworderrBBN - avgAFdist, refdur, wordengy, preworderrCU - avgAFdist, phnins, phncorDragon - phncor, preworderr, phnsub MSU - avgAFdist, phncor, phnins, phnsubJHU - avgAFdist, preworderr, refdurSRI - phncor, phnsub, phnins, wordfreq, hypdurWASH - avgAFdist, refdur, preworderr
Decision Tree Analysis of Errors - 1
• DURATION IS IMPORTANT IN DISTINGUISHING AMONG ERROR TYPES IN THE TREES ….
• WORD SUBSTITUTIONS VERSUS DELETIONSATT - refdur, phnsub, wordengy, postworderr, phncorBBN - phnsub, phncor, phninsCU - hypdur, phnsub, avgAFdist, phncor, Dragon - refdur, phnsub, avgAFdist, phninsMSU - refdur, phnsub, avgAFdist, phncor, phninsJHU - refdur, phnsub, phncor, phnins, postworderrSRI - refdur, wordengy, phnsub, wordfreq, phnins, phncorWASH - refdur, phnsub, phnins, phncor
• WORD SUBSTITUTIONS VERSUS INSERTIONSATT - hypdur, avgAFdist, preworderrBBN - hypdur, phnsub, avgphnfreq, refdur, preworderrCU - hypdur, avgphnfreq, postworderr Dragon - hypdurJHU - hypdur, phnsubMSU - avgphnfreq, hypdur, preworderr, phnsubSRI - hypdur, phnsubWASH - hypdur, phnsub, avgphnfreq, phndel, preworderr, phncor
Decision Tree Analysis of Errors - 2
Phone Error and Word Length • For CORRECT words, only one phone (on average) is misclassified• For INCORRECT words, phone errors increase linearly with word length
Data are averaged across all eight sites
Articulatory Features & Word Error
AFs include MANNER (e.g., stop, fricative, nasal, vowel, etc.), PLACE (e.g, labial, alveolar, velar), VOICING, LIP-ROUNDING
• Incorrect words exhibit nearly 3 times the AF errors as correct words
Data are averaged across all eight sites
Consonantal Onsets and AF Errors • Syllable onsets are intolerant of AF errors in CORRECT words• Place and manner AF errors are particularly high in INCORRECT onsets
Data are averaged across all eight sites
Consonantal Codas and AF Errors • Syllable codas exhibit a slightly higher tolerance for error than onsets
Data are averaged across all eight sites
Vocalic Nuclei and AF Errors • Nuclei exhibit a much higher tolerance for error than onsets & codas• There are many more errors than among syllabic onsets & codas
Data are averaged across all eight sites
Syllable Structure & Word Error Rate • Vowel-initial forms show the greatest error• Polysyllabic forms exhibit the lowest error
Data are averaged across all eight sites
• VOWEL-INITIAL forms exhibit the HIGHEST error• POLYSYLLABLES have the LOWEST error rate
Syllable Structure & Word Error Rate
• The effect of stress is most concentrated among word-deletion errors
Prosodic Stress & Word Error Rate
Data represent averages across all eight ASR systems
Unstressed Fully Stressed Intermediate Stress
• All 8 ASR systems show the effect of prosodic stress on word deletion rate
Prosodic Stress and Deletion Rate
0 = unstressed, 0.5 = intermediate stress, 1 = fully stressed
Prosodic Stress and Word Error Rate
• The effect of stress on overall word error is less pronounced than on deletions
0 = unstressed, 0.5 = intermediate stress, 1 = fully stressed
Different Measures of Speaking Rate
Statistic N-Words N-Syls N-Phns Duration MRate SylRate logEng
Minimum 2.00 5.00 11.00 1.63 2.15 2.40 9.63
25% 12.25 16.00 33.00 3.28 3.51 4.35 12.45
50% 16.00 20.00 44.00 4.08 3.86 4.87 13.33
Mean 18.53 23.25 50.49 4.76 3.87 4.88 13.31
75% 23.00 29.00 63.00 5.90 4.22 5.38 14.15
Maximum 64.00 81.00 186.00 17.43 5.81 7.81 17.01
Std. Dev. 9.10 11.35 25.73 2.15 0.55 0.80 1.20
• MRATE IS AN ACOUSTIC MEASURE BASED ON THE MODULATION PROPERTIES OF THE SIGNAL’S ENERGY ACROSS THE
SPECTRUM
• SYLLABLES/SEC IS A LINGUISTIC MEASURE OF SPEAKING RATE
• THE CORRELATION BETWEEN THE TWO METRICS (R) = 0.56
• MRATE GENERALLY UNDERESTIMATES THE SYLLABLE RATE– Non-speech, filled pauses, etc. are contained in MRATE but not in
syllable rate
0.00
0.05
0.10
0.15
0.20
2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 4.5 4.7 4.9 5.1 5.3 5.5 5.7 5.9
MRATE (Hz)
MRATE Distribution
Word Error and MRATE • MRATE (acoustic metric) is not predictive of word-error rate
Slowest and fastest speaking rates should exhibit the highest word error, but don’t (in terms of MRATE)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
2.25 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75
Syllables per Second
Syllable Rate Distribution• ONLY A SMALL PROPORTION (10%) OF UTTERANCES ARE FASTER THAN 6 SYLLABLES/SEC OR SLOWER THAN 3
SYLLABLES/SEC
• Syllables per second is a useful metric for predicting word-error rate
Word Error and Syllable Rate
Slow and fast speaking rates exhibit the highest word error (in terms of syllables/sec)
• THE DIAGNOSTIC MATERIAL MAY NOT BE TRULY REPRESENTATIVE OF THE SWITCHBOARD RECOGNITION TASK
– The competitive evaluation is based on entire conversations, whereas the current diagnostic material contains only relatively small amounts of material from any single speaker
– This strategy was intended to provide a broad coverage of different speaker qualities (gender, dialect, age, voice quality, topic, etc.), but …
– Was also designed to foil recognition based largely on speaker adaptation algorithms
• THE TIME-MEDIATED SCORING TECHNIQUE IS NOT “PERFECT” AND MAY HAVE INTRODUCED CERTAIN ERRORS NOT PRESENT IN
THE COMPETITIVE EVALUATION
• THE STP TRANSCRIPTION (REFERENCE) MATERIAL IS ALSO NOT “PERFECT” AND THEREFORE THE ANALYSES COULD UNDERESTIMATE A SITE’S PERFORMANCE ON BOTH FREE AND FORCED-ALIGNMENT-BASED RECOGNITION
Caveats
• SWITCHBOARD RECOGNITION SYSTEMS FROM EIGHT SEPARATE SITES WERE EVALUATED WITH RESPECT TO PHONE- AND
WORD- LEVEL CLASSIFICATION ON NON-COMPETITIVE DIAGNOSTIC MATERIAL
• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS
– Decision-tree analyses support this hypothesis– Additional analyses are also consistent with this conclusion
• SYLLABLE STRUCTURE AND PROSODIC STRESS ARE ALSO IMPORTANT FACTORS FOR ACCURATE RECOGNITION
– The pattern of errors differs across the syllable (onset, nucleus, coda)– Stress affects primarily the number of word-deletion errors
• SPEAKING RATE CAN BE USED TO PREDICT RECOGNITION ERROR– Syllables per second is a far more accurate metric than MRATE (an acoustic
measure based on the modulation spectrum)
• ASR SYSTEMS CAN POTENTIALLY BE IMPROVED BY FOCUSING MORE ATTENTION ON PHONETIC CLASSIFICATION, SYLLABLE STRUCTURE AND PROSODIC STRESS
Summary and Conclusions
• STRUCTURED QUERY LANGUAGE (SQL) DATABASE VERSION (11/2000)– Will provide quick and ready access to the entire set of recognition and
forced-alignment material over the web– Will enable accurate selection of specific subsets of the material for
detailed, intensive analysis and graphing without much scripting– Will accelerate analysis of the evaluation material, which is …
• POSTED ON THE PHONEVAL WEB SITE FOR WIDE DISSEMINATION
• DEVELOPMENT OF A HIGH-FIDELITY AUTOMATIC PHONETIC TRANSCRIPTION SYSTEM TO LABEL AND SEGMENT (IN PROGRESS)– This automatic system will enable accurate labeling and segmentation
of the remainder of the Switchboard corpus, thus enabling …
• PHONETIC AND LEXICAL DISSECTION OF THE COMPETITIVE EVALUATION SUBMISSIONS IN THE SPRING OF 2001– Hopefully providing further insight into ways in which ASR systems can
be improved
Into the Future …
That’s All, Folks
Many Thanks for Your Time and Attention
Additional Slides for Discussion
• EACH SITE’S SUBMISSION WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE)
CTM File Format for Word Scoring
SOURCE UTID SIDE START DUR WORD ERTYP
REFERENCE 2001-B-0016 B 0 0.11 ? NHYPOTHESIS 2001-B-0016 B *** *** *** N
R 2001-B-0016 B 0.11 0.18 LIKE CH 2001-B-0016 B 0.1 0.18 LIKE C
R 2001-B-0016 B 0.29 0.08 IN CH 2001-B-0016 B 0.28 0.09 IN C
R 2001-B-0016 B 0.37 0.48 AUGUST CH 2001-B-0016 B 0.37 0.45 AUGUST C
R 2001-B-0016 B 0.85 0.07 WHEN CH 2001-B-0016 B 0.82 0.1 WHEN C
R 2001-B-0016 B 0.92 0.44 EVERYBODY_IS SH 2001-B-0016 B 0.92 0.33 EVERYBODY S
R 2001-B-0016 B *** *** *** IH 2001-B-0016 B 1.25 0.1 IS I
R 2001-B-0016 B 1.36 0.15 ON CH 2001-B-0016 B 1.35 0.15 ON C
… … … … … … …
ERROR KEY
C = CORRECTI = INSERTION N = NULL ERRORS = SUBSTITUTION
• HOW ACCURATE IS THE PHONETIC SEGMENTATION PROVIDED BY FORCED-ALIGNMENT-BASED RECOGNITION?– The average disparity between the phone duration of the reference
(STP) corpus and the duration of the forced alignment phones is substantial (ca. 40% of the mean duration of a phone in the corpus)
• AUTOMATIC ALIGNERS ARE NOT RELIABLE PHONE SEGMENTERS
Precision of Forced Alignment Segmentation
MEAN DISPARITY
SITE in millisec rel to mean
BBN 31 0.39
CAMBRIDGE 30 0.38
JOHNS HOPKINS 37 0.47
MISSISSIPPI 31 0.39
SRI 32 0.40
WASHINGTON 31 0.39
MEAN of 6 SITES 32 0.40
Mean phone duration in corpus = 79.3 ms
There is virtually no skew in disparity between beginning and ending portions of the phones (i.e., no bias in segmentation)
• RELATION OF THE NUMBER OF PHONES IN A WORD TO WORD ERROR– Done by George Doddington of NIST using both the free and forced-
alignment recognition results (from the “Big Lists”)– Reveals an interesting relationship between the number of phones
correctly (or incorrectly) classified and the probability of a word being correctly (or incorrectly) labeled
– Also shows the extent to which decoders are tolerant of phone classification errors
– George’s analysis is consistent with the D-Tree analyses suggesting that phone classification is the controlling variable for word error
– George will discuss this material directly following this presentation
• ANALYSIS OF PHONETIC CONFUSIONS IN THE CORPUS MATERIAL– Performed by Joe Kupin and Hollis Fitch of the Institute for Defense
Analysis– The output of their scripts are available on the PHONEVAL web site– Hollis will discuss some of their results directly after George’s
presentation
Analyses Performed By Others
• UTTERANCE LEVEL – Utterance ID
– Number of Words in Utterance
– Utterance Duration
– Utterance Energy (Abnormally Low or High Amplitude)
– Utterance Difficulty (Very Easy, Easy, Medium, Hard, Very Hard)
– Speaking Rate - Syllables per Second
– Speaking Rate - Acoustic Measure (MRATE)
Speech Parameters Analyzed - 1
• LEXICAL LEVEL – Lexical Identity – Word Error Type - Substitution, Deletion, Insertion, Null– Word Error Type Context (Preceding/Following)– Unigram Frequency (in Switchboard Corpus)– Number of Syllables in Word (Canonical)– Number of Phones in Word (Canonical)– Number of Phones Incorrect at Word Level (and Error Types)– Phonetic Feature Distance Between Hypothesized/Reference Word– Position of the Word in the Utterance – Lexical Compound Status (Part of a Compound or Not)– Word Duration– Word Energy – Prosodic Prominence (Maximum and Average Stress)– Prosodic Context -Maximum/Average Stress (Preceding/Following)– Temporal Alignment Between Reference and Hypothesized Word
Speech Parameters Analyzed - 2
• PHONE LEVEL – Phone ID (Reference and Hypothesized)
– Phone Duration (Reference and Hypothesized)
– Phone Position within the Word
– Phone Frequency (Switchboard Transcription Corpus)
– Phone Error Type (Substitution, Deletion, Insertion, Null)
– Phone Error Context (Preceding/Following Phone)
– Temporal Alignment Between Reference and Hypothesized Phone
– Phonetic Feature Distance Between Reference/Hypothesized Phone
– Phonetic Feature Analysis Between Reference/Hypothesized Phone+ Manner of Articulation+ Place of Articulation+ Voicing+ Lip Rounding
Speech Parameters Analyzed - 3
• SPEAKER CHARACTERISTICS – Dialect Region
– Gender
– Recognition Difficulty (Very Easy, Easy, Medium, Hard, Very Hard)
– Speaking Rate - Syllables per Second and Acoustic (MRATE)
Speech Parameters Analyzed - 4
Phone-Centric “Big List”
PhERR WDERR REFPHN HYPPHN REFWORD HYPWORD PHNPOS RDUR HDUR RVOI HVOI AFDIST MSTRESS
S N ? H# ? *** 0 0.11 0.08 2 2 0 0C C L L LIKE LIKE 0 0.05 0.08 0 0 0 0.5C C AY AY LIKE LIKE 0.333 0.07 0.05 0 0 0 0.5C C K K LIKE LIKE 0.667 0.06 0.07 1 1 0 0.5C C IH IH IN IN 0 0.04 0.04 0 0 0 0C C N N IN IN 0.5 0.04 0.08 0 0 0 0C C AO AO AUGUST AUGUST 0 0.24 0.18 0 0 0 1C C G G AUGUST AUGUST 0.167 0.05 0.09 0 0 0 1S C IX AH AUGUST AUGUST 0.333 0.08 0.05 0 0 2 1I C *** S AUGUST AUGUST 0.5 0 0.04 2 1 5 1S C S T AUGUST AUGUST 0.667 0.07 0.06 1 1 1 1C C W W AUGUST AUGUST 0.833 0.04 0.03 0 0 0 1C C W W WHEN WHEN 0 0.04 0.03 0 0 0 0S C EH AX WHEN WHEN 0.333 0.04 0.04 0 0 1 0C C N N WHEN WHEN 0.667 0.04 0.03 0 0 0 0C C EH EH EVERYBODY'S EVERYBODY'S 0 0.04 0.05 0 0 0 0.5S C R V EVERYBODY'S EVERYBODY'S 0.111 0.04 0.03 0 0 2 0.5C C R R EVERYBODY'S EVERYBODY'S 0.222 0.03 0.03 0 0 0 0.5C C IY IY EVERYBODY'S EVERYBODY'S 0.333 0.04 0.05 0 0 0 0.5C C B B EVERYBODY'S EVERYBODY'S 0.444 0.07 0.05 0 0 0 0.5S C AA AH EVERYBODY'S EVERYBODY'S 0.556 0.06 0.05 0 0 1 0.5C C D D EVERYBODY'S EVERYBODY'S 0.667 0.03 0.07 0 0 0 0.5C C IY IY EVERYBODY'S EVERYBODY'S 0.778 0.04 0.03 0 0 0 0.5C C Z Z EVERYBODY'S EVERYBODY'S 0.889 0.07 0.08 1 0 1 0.5C C AA AA ON ON 0 0.09 0.07 0 0 0 0C C N N ON ON 0.5 0.07 0.07 0 0 0 0C C V V VACATION VACATION 0 0.04 0.03 0 0 0 1C C EY EY VACATION VACATION 0.143 0.09 0.11 0 0 0 1C C K K VACATION VACATION 0.286 0.1 0.1 1 1 0 1C C EY EY VACATION VACATION 0.429 0.13 0.12 0 0 0 1C C SH SH VACATION VACATION 0.571 0.08 0.1 1 1 0 1S C IH AX VACATION VACATION 0.714 0.04 0.04 1 0 3 1C C N N VACATION VACATION 0.857 0.06 0.04 0 0 0 1C C ER ER OR OR 0 0.06 0.07 0 0 0 0C C S S SOMETHING SOMETHING 0 0.12 0.11 1 1 0 0.5C C AH AH SOMETHING SOMETHING 0.143 0.06 0.04 0 0 0 0.5C C M M SOMETHING SOMETHING 0.286 0.04 0.07 0 0 0 0.5C C TH TH SOMETHING SOMETHING 0.429 0.04 0.04 1 1 0 0.5C C IH IH SOMETHING SOMETHING 0.571 0.05 0.05 0 0 0 0.5D C NG *** SOMETHING SOMETHING 0.714 0.01 0 0 2 5 0.5S C K NG SOMETHING SOMETHING 0.857 0.04 0.06 1 0 3 0.5
• THE PHONE “BIG LISTS” CONTAIN INFORMATION PERTAINING TO THE PHONETIC-FEATURE DISTANCE BETWEEN THE HYPOTHESIZED AND REFERENCE (STP) PHONE SEQUENCES, AS WELL AS MANY OTHER PARAMETERS
Syllable-Centric Feature Analysis• Place of articulation deviates most in nucleus position• Manner of articulation deviates most in onset and coda position• Voicing deviates most in coda position
Phonetic deviation along a SINGLE feature
Place deviates very little from canonical form in the onset and coda. It
is a STABLE AF in these positions
Place is VERY unstable in nucleus position
Articulatory PLACE Feature Analysis• Place of articulation is a “dominant” feature in nucleus position only• Drives the feature deviation in the nucleus for manner and rounding
Phonetic deviation across SEVERAL features
Place “carries” manner and rounding in the nucleus
• Manner of articulation is a “dominant” feature in onset and coda position• Drives the feature deviation in onsets and codas for place and voicing
Articulatory MANNER Feature Analysis
Manner is less stable in the coda than in the onset
Manner drives place and
voicing deviations in the onset and
coda
Phonetic deviation across SEVERAL features
• Voicing is a subordinate feature in all syllable positions• Its deviation pattern is controlled by manner in onset and coda positions
Articulatory VOICING Feature Analysis
Place is unstable in coda position and is dominated by manner
Phonetic deviation across SEVERAL features
• Lip-rounding is a subordinate feature• Its deviation pattern is driven by the place feature in nucleus position
LIP-ROUNDING Feature Analysis
Rounding is stable everywhere except in
the nucleus where it is driven by place
Phonetic deviation across SEVERAL features
Syllable-Centric Pronunciation
(Spontaneous speech)
(Read Sentences)“Cat” [k ae t][k] = onset[ae] = nucleus[t] = coda
Onsets are pronouncedcanonically far more often than nuclei or codas
Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues
Percent Canonically Pronounced
70
75
80
85
90
95
100
Simple (C) Complex (CC(C))
STP
TIMIT
Syllable-Centric PronunciationComplex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation
(Spontaneous speech)
(Read Sentences)
Percent Canonically Pronounced
Syllable Onset Type
50
55
60
65
70
AllNuclei
WithOnset
WithoutOnset
WithCoda
WithoutCoda
STP
TIMIT
Onsets (but not Codas) Affect Nuclei
Percent Canonically Pronounced
The presence of a syllable onset has a substantial impact on the realization of the nucleus
Syllable-Centric Pronunciation
Percent Canonically Pronounced
Codas are much more likely to be realized canonically in formal than in spontaneous speech
Syllable Coda Type
Syllable-Centric Pronunciation
(Spontaneous speech)
(Read Sentences)Cat [k ae t][k] = onset[ae] = nucleus[t] = coda
Onsets are pronouncedcanonically far more often than nuclei or codas
Codas tend to be pronounced canonically more frequently in formal speech than in spontaneous dialogues
Percent Canonically Pronounced
Syllable Position
70
75
80
85
90
95
100
Simple (C) Complex (CC(C))
STP
TIMIT
Syllable Onsets are ImportantComplex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation
(Spontaneous speech)
(Read Sentences)
Percent Canonically Pronounced
Syllable Onset Type
50
55
60
65
70
AllNuclei
WithOnset
WithoutOnset
WithCoda
WithoutCoda
STP
TIMIT
Onsets (but not Codas) Affect Nuclei
Percent Canonically Pronounced
The presence of a syllable onset has a substantial impact on the realization of the nucleus
Syllable-Centric Pronunciation
Percent Canonically Pronounced
Codas are much more likely to be realized canonically in formal than in spontaneous speech
Syllable Coda Type