phonetic dissection of switchboard-corpus automatic speech recognition systems steven greenberg and...

Phonetic Dissection of

Switchboard-CorpusAutomatic Speech Recognition Systems

Steven Greenberg and Shuangyu Chang

International Computer Science Institute1947 Center Street, Berkeley, CA 94704

{steveng, shawnc}@icsi.berkeley.eduhttp://www.icsi.berkeley.edu/~steveng

Large Vocabulary Continuous Speech Recognition Workshop Maritime Institute of Technology, Linthicum Heights, MD, May 4, 2001

• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Many different analyses (to follow) support this conclusion– Consonants appear to be more important than vowels

• SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION– The pattern of errors differs across the syllable (onset, nucleus, coda) and

exhibit consistent patterns difficult to discern with other units of analysis

• STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS– Relation among stress-accent, syllable structure, vocalic identity and length

• THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR– The unit of lexical representation (phones, articulatory features, etc.) is

probably of the utmost importance for optimizing ASR performance

• FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN LANGUAGE

Take Home Messages

• DESCRIPTION OF THE CORPUS MATERIALS FOR THE 2000 AND 2001 EVALUATIONS

– 2000 – Brief (2-17 s) utterances spoken by hundreds of different speakers. No relation to competitive evaluation

– 2001 – A subset of the competitive evaluation

• BRIEF OVERVIEW OF THE ANALYSIS REGIME COMMON TO THE 2000 AND 2001 PHONETIC EVALUATIONS

– File formats, time-mediated alignment, statistical analysis of the corpora, etc.– Details are contained in “Linguistic Dissection …..” (in workshop notebook)

and in “An Introduction ….” (NIST Speech Transcription Workshop, 2000)

• ANALYSES AND PATTERNS COMMON TO BOTH 2000 and 2001 EVALUATIONS

– Syllable structure, phonetic segments, articulatory-acoustic features. Details pertaining to the 2000 evaluation are in the papers cited above

• PHONETIC CONFUSION MATRICES FOR THE 2001 EVALUATION

• FUTURE ANALYSIS PLANNED FOR THIS SPRING WHEN REMAINING 2001 SUBMISSIONS ARRIVE

– Relationship between phonetic classification, pronunciation and language models

Structure of the Presentation

• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Switchboard contains informal telephone dialogues

– 54 minutes of material that previously phonetically transcribed (by highly trained phonetics students from UC-Berkeley)

– All of this material was hand-segmented at either the phonetic-segment or syllabic level by the transcribers

– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified.

• THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE CURRENT PROJECT ARE AVAILABLE ON THE PHONEVAL WEB SITE:

http://www.icsi.berkeley.edu/real/phoneval

• THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL ARE AVAILABLE AT:

http://www.icsi.berkeley.edu/real/stp

Evaluation Material - 2000

Evaluation Material Details - 2000

0

50

100

150

200

250

300

V_Easy Easy Medium Hard V_Hard

Subjective Difficulty

By Subjective Difficulty

0

20

40

60

80

100

120

140

160

180

S_Mid N_Mid N_East West South NYC (Other)

Dialect Region

Nu

mb

er o

f U

tter

ance

s

By Dialect Region

• 581 DIFFERENT SPEAKERS

• AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS

• BROAD DISTRIBUTION OF UTTERANCE DURATIONS– 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10%

• COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD

• A WIDE RANGE OF DISCUSSION TOPICS

• VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)

• SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS – Seventy-four minutes of material phonetically labeled by five highly

trained phonetics students from UC-Berkeley plus S. Greenberg

– The material was hand-segmented at the syllabic level by the transcribers

– The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained originally on 72-minutes of hand-segmented Switchboard material (similar to the process performed the previous year)

• THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED ARE AVAILABLE ON THE PHONEVAL WEB SITE:


Evaluation Material - 2001

Evaluation Material Details - 2001• A SUBSET OF THE HUB-5 COMPETITIVE EVALUATION CORPUS

– A representative selection from the evaluation set, including an even distribution of data from the three main recording conditions (cellular and 2 land-line conditions)

• 21 SEPARATE CONVERSATIONS (2 speakers per conversation)

• 42 DIFFERENT SPEAKERS

• A TOTAL OF 74 MINUTES OF SPOKEN LANGUAGE MATERIAL – (including FILLED PAUSES, JUNCTURES, etc.)

• AVERAGE LENGTH OF SPEECH PER SPEAKER – 106 seconds

• RANGE OF LENGTH PER SPEAKER – 48 s (least) to 226 s (most)

• STANDARD DEVIATION – 38 s

• APPROXIMATELY ONE-THIRD OF THE MATERIAL FROM CELL PHONES

• EIGHT SITES PARTICIPATED IN THE EVALUATION– All eight provided material for the unconstrained-recognition phase– Six sites also provided sufficient forced-alignment-recognition

material (i.e., phone/word labels and segmentation given the word transcript for each utterance) for a detailed analysis

• AT&T (forced-alignment recognition incomplete, not analyzed )

• Bolt, Beranek and Newman

• Cambridge University

• Dragon (forced-alignment recognition incomplete, not analyzed )

• Johns Hopkins University

• Mississippi State University

• SRI International

• University of Washington

Evaluation Sites - 2000

• SEVEN SITES ARE PARTICIPATING IN THE EVALUATION– Unconstrained-recognition phase – 6 Sites– Forced-alignment – 7 Sites– Phone classification confidence scores – 5 Sites– Variable condition recognition – 2 Sites– Phone strings to words - 1 Site

• AT&T

• Bolt, Beranek and Newman

• IBM

• Johns Hopkins University

• Mississippi State University

• Philips

• SRI International

Evaluation Sites - 2001

• However … NOT ALL OF THE MATERIAL REQUIRED TO PERFORM THE ANALYSES HAVE MATERIALIZED

– The tables below summarize the commitments and currently usable data (certain data arrived in not-quite-ready-for-prime-time

form)

Evaluation Data Status - 2001

Commitments

Current(usable data)

SITE RECOGNITION FORCED-ALIGNMENT PHONE CONFIDENCE VARIABLE RECOGNITION PHONES-TO-WORDS

AT&T

BBN X X

IBM X

JHU X X X

MSU X X

Philips X X X

SRI X X

Parameter Key

START - Begin time (in seconds) of phone

DUR - Duration (in sec) of phone

PHN - Hypothesized phone ID

WORD - Hypothesized Word ID

Format is for all 674 files in the evaluation set

(Example courtesy of MSU)

Initial Recognition File - Example

UTT-ID CH Start DUR PHN WORD

2001_0016 B 0 0.1 sil !SENT_START

2001_0016 B 0.1 0.06 l

2001_0016 B 0.16 0.05 ay

2001_0016 B 0.21 0.07 k LIKE

2001_0016 B 0.28 0.04 ih

2001_0016 B 0.32 0.05 n IN

2001_0016 B 0.37 0.21 ao

2001_0016 B 0.58 0.08 g

2001_0016 B 0.66 0.08 ax

2001_0016 B 0.74 0.03 s

2001_0016 B 0.77 0.04 t

2001_0016 B 0.81 0.01 sp AUGUST

2001_0016 B 0.82 0.03 w

2001_0016 B 0.85 0.03 eh

2001_0016 B 0.88 0.04 n WHEN

2001_0016 B 0.92 0.05 eh

2001_0016 B 0.97 0.03 v

2001_0016 B 1 0.03 r

2001_0016 B 1.03 0.05 iy

2001_0016 B 1.08 0.06 b

2001_0016 B 1.14 0.05 aa

2001_0016 B 1.19 0.03 d

2001_0016 B 1.22 0.03 iy EVERYBODY

• EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET– Most of the phone sets are available on the PHONEVAL web site

• THE SITES’ PHONE SETS WERE MAPPED TO A COMMON “REFERENCE” PHONE SET – The reference phone set is based on the ICSI Switchboard

transcription material (STP), but is adapted to match the less granular symbol sets used by the submission sites

– The set of mapping conventions relating to the STP (and reference) sets are also available on the PHONEVAL web site

• THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SUBMISSION SITE PHONE SETS

– This reverse mapping was done in order to insure that variants of a phone were given due “credit” in the scoring procedure

– For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set

Phone Mapping Procedure

• TWO METHODS WERE USED FOR THE 2001 EVALUATION– The “UNCOMPENSATED” form is the same as last year’s scoring

method. Only common phone ambiguities (such as [ix], [ih], [ah]. [ax], etc. are allowed

– The “TRANSCRIPTION-COMPENSATED” form allows for certain phones commonly confused among human transcribers to be scored as “correct,” even though they would otherwise be scored as “wrong”

– The compensated form of transcription lowers the phone “error” by ca. 10-20%

• TIME-MEDIATED SCORING WAS OF TWO VARIETIES

– A “STRICT” form is identical to that used in last year’s evaluation. There is a severe penalty for deviations from time boundaries for words and phones

– A “LENIENT” form allows for a much looser fit between time markers associated with words and phones. A weighting of 0.15 (relative to the STRICT form) was used (by modifying the penalty algorithm in SC-Lite). The 0.15 weight reduced the number of phone “errors” by ca. 20% without a significant decline in false-positive responses

Phone Scoring Procedures - 2001

00.20.40.60.8

1

B D G P T K DX JH CH S SH Z ZH F TH V DH M N NX NG L R W Y HH OTH

Visualization of a 3-D Confusion Matrix• When the matrix is sparsely coded, as below, it is more efficient to

view the pattern as if squashed against a brick wall (see below)

The diagonal is plotted in a linear plane

Phonetic Segment

Pro

po

rtio

n C

on

cord

ance

Consonants

Interlabeler Agreement (74%) - 3 Transcribers• Highest for consonants (especically the stops)

• Lowest for vowels (particularly the lax monophthongs)

Numbers refer to the concordance diagonal in the confusion matrices

Vowels

Interlabeler Disagreement Patterns - 2001• INTERLABELER DISAGREEMENT PATTERNS WERE DERIVED FROM THE

2000 EVALUATION MATERIAL– Several minutes of 3 transcribers material transcribed in common were

analyzed (2 from 1996-1997 STP, 1 from 2001 STP)

• THE FOLLOWING PATTERNS WERE OBSERVED IN THE INTERLABELER DISAGREEMENT ANALYSIS

• Consonants– Stop and nasal consonants exhibit a small amount of disagreement– Fricatives exhibit slightly higher amounts of disagreement– Liquids show a moderate amount of disagreement

• Vowels– Lax monophthongs exhibit a high amount of disagreement– Diphthongs show a relatively small amount of disagreement– Tense, low monophthongs show relatively little disagreement (except

for [ao] (probably a dialect issue)

• Overall Transcriber Agreement was 70%

Interlabeler Disagreement Patterns - 2001• FROM SUCH PATTERNS THE FOLLOWING FORMS OF TOLERANCES WERE

ALLOWED IN “TRANSCRIPTION COMPENSATED” SCORING:

Segment

[d]

[k]

[s]

[n]

[r]

[iy]

[ao]

[ax]

[ix]

UNcompensated

[d]

[k]

[s]

[n]

[r]

[iy]

[ao]

[ax]

[ix] [ih] [ax]

Compensated

[d] [dx]

[k]

[s] [z]

[n] [nx] [ng] [en]

[r] [axr] [er]

[iy] [ix] [ih]

[ao] [aa] [ow]

[ax] [ah] [aa] [ix]

[ix] [ih] [iy] [ax]

Transcription Compensation Affects Phone Error• COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS

LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES

0

0.1

0.2

0.3

0.4

0.5

SRIJHUBBNIBMMSU

TranscriptionUncompensated

TranscriptionCompensated

TranscriptionUncompensatedTranscriptionCompensated

Error Rate

STRICTTime Mediation

Transcription Compensation Affects Phone Error• COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS

LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES

0

0.1

0.2

0.3

0.4

0.5

SRIJHUBBNIBMMSU

TranscriptionUncompensated

TranscriptionCompensated

TranscriptionUncompensatedTranscriptionCompensated

Error Rate

LENIENTTime Mediation

Generation of Evaluation Data - 1

• EACH SITE’S MATERIAL WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE)

CTM File Format for Word Scoring

SOURCE UTID SIDE START DUR WORD ERTYP

REFERENCE 2001-B-0016 B 0 0.11 ? NHYPOTHESIS 2001-B-0016 B *** *** *** N

R 2001-B-0016 B 0.11 0.18 LIKE CH 2001-B-0016 B 0.1 0.18 LIKE C

R 2001-B-0016 B 0.29 0.08 IN CH 2001-B-0016 B 0.28 0.09 IN C

R 2001-B-0016 B 0.37 0.48 AUGUST CH 2001-B-0016 B 0.37 0.45 AUGUST C

R 2001-B-0016 B 0.85 0.07 WHEN CH 2001-B-0016 B 0.82 0.1 WHEN C

R 2001-B-0016 B 0.92 0.44 EVERYBODY_IS SH 2001-B-0016 B 0.92 0.33 EVERYBODY S

R 2001-B-0016 B *** *** *** IH 2001-B-0016 B 1.25 0.1 IS I

R 2001-B-0016 B 1.36 0.15 ON CH 2001-B-0016 B 1.35 0.15 ON C

… … … … … … …

ERROR KEY

C = CORRECTI = INSERTION N = NULL ERRORS = SUBSTITUTION

• LEXICAL PROPERTIES – Lexical Identity– Unigram Frequency– Number of Syllables in Word– Number of Phones in Word– Word Duration– Speaking Rate– Prosodic Prominence– Energy Level– Lexical Compounds– Non-Words– Word Position in Utterance

• SYLLABLE PROPERTIES– Syllable Structure– Syllable Duration– Syllable Energy– Prosodic Prominence– Prosodic Context

Summary of Corpus Acoustic Properties• PHONE PROPERTIES

– Phonetic Identity– Phone Frequency– Position within the Word– Position within the Syllable– Phone Duration– Speaking Rate– Phonetic Context– Contiguous Phones Correct– Contiguous Phones Wrong– Phone Segmentation– Articulatory Features– Articulatory Feature Distance– Phone Confusion Matrices

• OTHER PROPERTIES– Speaker (Dialect, Gender)– Utterance Difficulty– Utterance Energy– Utterance Duration

Word- and Phone-Centric “Big Lists”

ERR REFWORD HYPWORD UTID WORDPOS WORDFREQ WRDENG MRATE SYLRATE ETC.

N ? *** 2001-B-0016 0 -6.02 0.92 5.05 6.56 …C LIKE LIKE 2001-B-0016 0.06 -2.1522 1.04 5.05 6.56 …C IN IN 2001-B-0016 0.11 -1.9295 0.97 5.05 6.56 …C AUGUST AUGUST 2001-B-0016 0.17 -4.6678 1.1 5.05 6.56 …C WHEN WHEN 2001-B-0016 0.22 -2.5432 0.97 5.05 6.56 …C EVERYBODY'S EVERYBODY'S 2001-B-0016 0.28 -4.3253 1.02 5.05 6.56 …C ON ON 2001-B-0016 0.33 -2.3138 0.97 5.05 6.56 …C VACATION VACATION 2001-B-0016 0.39 -3.9967 0.95 5.05 6.56 …C OR OR 2001-B-0016 0.44 -2.3202 0.84 5.05 6.56 …C SOMETHING SOMETHING 2001-B-0016 0.5 -2.7438 0.81 5.05 6.56 …C WE WE 2001-B-0016 0.56 -2.1082 0.88 5.05 6.56 …C CAN CAN 2001-B-0016 0.61 -2.611 0.75 5.05 6.56 …C DRESS DRESS 2001-B-0016 0.67 -4.0399 0.9 5.05 6.56 …C A A 2001-B-0016 0.72 -1.6723 0.85 5.05 6.56 …C LITTLE LITTLE 2001-B-0016 0.78 -2.7814 0.91 5.05 6.56 …C MORE MORE 2001-B-0016 0.83 -2.7027 0.85 5.05 6.56 …C CASUAL CASUAL 2001-B-0016 0.89 -4.6678 0.94 5.05 6.56 …I *** !SILENCE 2001-B-0016 0.94 -6.02 0.6 5.05 6.56 …N H# *** 2005-B-0077 0 -6.02 0.6 4.44 7 …N ? *** 2005-B-0077 0.06 -6.02 0.92 4.44 7 …C YEAH YEAH 2005-B-0077 0.12 -1.9361 0.99 4.44 7 …C JUST JUST 2005-B-0077 0.18 -2.1809 0.94 4.44 7 …C BECAUSE BECAUSE 2005-B-0077 0.24 -2.4782 1.09 4.44 7 …… … … … … … … … … …

• THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55-65 SEPARATE PARAMETERS ASSOCIATED WITH PHONES, SYLLABLES, WORD, UTTERANCES AND SPEAKERS SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE

Phoneval-2000 Web SiteRECOGNITION FILES•Converted Submissions

ATT, BBN , JHU, MSU, SRI, WASH

•Word Level Recognition ErrorsATT, CU, BBN , JHU, MSU, SRI, WASH

•Phone Error (Free Recognition)ATT, BBN, JHU, MSU, WASH •Word Recognition Phone Mapping

ATT, BBN, JHU, MSU, WASH

BIG LISTS•Word-Centric

ATT, CU, BBN, JHU, MSU, SRI, WASH

•Phone-CentricATT, BBN, JHU, MSU, WASH

•Phonetic Confusion MatricesATT, BBN, JHU, MSU, WASH

FORCED ALIGNMENT FILES•Forced Alignment Files

BBN , JHU, MSU, WASH

•Word-Level Alignment ErrorsBBN , CU, JHU, MSU, SRI, WASH

•Phone Error (Forced Alignment)CU, BBN, JHU, MSU, SRI, WASH •Alignment Word-Phone Mapping

BBN , JHU, MSU, WASH

BIG LISTS•Word-Centric

BBN, CU, JHU, MSU, SRI, WASH

•Phone-CentricBBN, JHU, MSU, WASH

•Phonetic Confusion MatricesBBN, JHU, MSU, WASH

•Description of the STP Phone Set•STP Transcription Material

Phone-Word Reference

Syllable-Word Reference

•Phone Mapping for Each SiteATT, BBN , JHU, MSU, WASH

STP-to-Reference Map

STP Phone-to-Articulatory-Feature Map


A Syllable-Centric PerspectiveIn this presentation we will “drill down” from the lexical to the phonetic tiers by way of

the syllable, the phone and articulatory-acoustic features

Words

Articulatory-Acoustic Features

Phonetic segment

Stress-accent

• THE FOLLOWING SLIDES PROVIDE DETAILS ABOUT THE COARSE WORD AND PHONE SCORES FOR THE 2000 AND 2001 EVALUATIONS

• ALTHOUGH THE WORD AND PHONE SCORES ARE ROUGHLY COMPARABLE ACROSS YEARS (FOR ANALOGOUS

CONDITIONS) THE 2001 EVALUATION HAS FOUR TIMES THE NUMBER OF SCORING CONDITIONS (FOR PHONES) BASED ON THE “LENIENT” vs. STRICT TIME-MEDIATION AND THE COMPENSATED vs. UNCOMPENSATED TRANSCRIPTION SCORING

Coarse Word and Phone Recognition

Word Recognition Error (2000)E

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

ATT

BBN

CU

DRAGON

JHU

MSU

SRI

WASH

Site

• WORD ERROR RATES VARY BETWEEN 27% AND 43%–Substitutions are the major source of word errors

• The effect of stress is most concentrated among word-deletion errors

Prosodic Stress & Word Error Rate (2000)

Data represent averages across all eight ASR systems

Unstressed Fully Stressed Intermediate Stress

Syllable Structure & Word Error Rate (2000) • Vowel-initial forms show the greatest error• Polysyllabic forms exhibit the lowest error

Data are averaged across all eight sitesC = ConsonantV = Vowel

• VOWEL-INITIAL forms exhibit the HIGHEST error• POLYSYLLABLES have the LOWEST error rate

Syllable Structure & Word Error Rate (2000)


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

IBM

JHU

MSU

SRI

Site

• WORD ERROR RATES VARY BETWEEN 33% AND 49%–Substitutions are the major source of phone errors

STRICT Time Mediation


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

IBM

JHU

MSU

SRI

Site

• WORD ERROR RATES VARY BETWEEN 31% AND 44%–Substitutions are the major source of phone errors

LENIENT Time Mediation

• NOT YET

• PROSODIC LABELING OF THIS MATERIAL REQUIRED FIRST

• ANALYSIS SCHEDULED FOR JUNE, 2001

Prosodic Stress & Word Error Rate (2001)

Syllable Structure & Word Error Rate (2001) • Vowel-initial forms show the greatest error

Data are averaged across all five sites

• Polysyllabic forms exhibit the lowest error, except fpr CVCV forms (probably due to forms such as “gonna,” etc.)

• VOWEL-INITIAL forms exhibit the HIGHEST error• POLYSYLLABLES have the LOWEST error rate

Syllable Structure & Word Error Rate (2001)

Are Word and Phone Errors Related? (2000)• COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON THE PHONE ERROR RATE

–The correlation between the two parameters is 0.78

0

0.1

0.2

0.3

0.4

0.5

0.6

JHUATTCUSRIDRAGMSUUWBBN

Phone Error

Word Error

Phone ErrorWord Error

Submission Site

Error Rate

Pronunciation Models?

The differential error rate is

probably related to the use of

either pronunciation or

language models (or both)


0

0.1

0.2

0.3

0.4

0.5

SRIJHUBBNIBMMSU

Phone Error

Word Error

Phone ErrorWord ErrorTranscription

UnCompensated

Error Rate

Pronunciation Model?Strict

Time Mediation


0

0.1

0.2

0.3

0.4

0.5

SRIJHUBBNIBMMSU

Phone Error

Word Error

Phone ErrorWord Error

TranscriptionUnCompensated

Error Rate

Pronunciation Model?Lenient

Time Mediation

Phonetic - Pronunciation Mismatch• THERE ARE A FAR GREATER NUMBER OF PRONUNCIATIONS IN THE

TRANSCRIPTION MATERIALS THAN IN THE ASR LEXICONS

• GIVEN THAT MOST WORDS ARE CORRECTLY RECOGNIZED, THIS RESULT IMPLIES THAT PHONETIC CLASSIFICATION IN ASR SYSTEMS IS, BY NECESSITY, HIGHLY AGRANULAR

• THUS, UNUSUAL PRONUNCIATIONS ARE UNLIKELY TO BE DECODED CORRECTLY

• THE COARSE NATURE OF THE PRONUNCIATION MODELS ALSO MAKE IT DIFFICULT TO FINE-TUNE THE RELATION BETWEEN THE PHONETIC CLASSIFIER AND PRONUNCIATION MODEL COMPONENTS

Pronunciation Variation in ASR Lexicons• MOST WORDS IN THE ASR LEXICONS HAVE A SINGLE PRONUNCIATION

• EXCEPTIONS ARE HIGHLY FREQUENT WORDS (SUCH AS “THE” AND “AND” WHICH HAVE 2 OR 3 PRONUNCIATION VARIATIONS. NO

WORD HAS MORE THAN 5 PRONUNCIATION VARIANTS (AT LEAST NOT IN THE PHONETIC OUTPUT PROVIDED TO ICSI FOR THE EVALUATION)

Pronunciation Variation in Switchboard (2001)• THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100

MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL

WORD INSTANCES #PRON WORD INSTANCES #PRON I 588 79AND 430 76THE 408 59YOU 317 54A 285 54THAT 229 66TO 223 47KNOW 211 23IN 209 41IT 208 54OF 198 56LIKE 170 25YEAH 165 22HAVE 135 38THEY 128 23IT'S 122 39BUT 113 22DON'T 112 42SO 107 16UH 107 16IS 97 21WAS 95 34FOR 91 20DO 90 26JUST 88 26

THAT'S 84 39IF 82 23ON 82 23THINK 82 19WE 82 10OR 77 24BE 73 12NOT 70 15WHAT 70 18MY 69 10I'M 67 18WELL 61 21WITH 57 27ARE 55 20THERE 54 15MEAN 52 9AT 51 23PEOPLE 49 20THEY'RE 49 19THIS 49 15UP 49 15AS 48 21GET 48 12REALLY 48 18LOT 47 17

Pronunciation Variation in Switchboard (2001)

WORD INSTANCES #PRON WORD INSTANCES #PRON WOULD 47 10ALL 46 11ONE 46 8TIME 44 11OUT 41 16HE 40 12NO 39 8ABOUT 38 26RIGHT 38 10THEN 38 10WORK 38 7BECAUSE 37 32KIND 37 13WHEN 37 13NOW 36 9YOU'RE 36 18ACTUALLY 35 20FROM 34 12HAD 34 7GOOD 33 7HE'S 33 9WHERE 33 13BEEN 31 11DID 31 11HERE 31 9

• THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100 MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL

GUESS 30 5THEM 30 10TOO 30 6GOT 29 11I'VE 29 15ME 29 4OKAY 29 12SOME 29 5WHO 29 11ANY 28 12THERE'S 28 21WERE 28 9HAS 27 13MORE 27 9CAN 26 11GONNA 26 19SOMETHING 26 18PRETTY 25 11YOUR 25 12COULD 24 7GO 24 5SHE 24 6EVEN 23 18OUR 23 10THINGS 23 8

Phone Error and Word Length (2000)

Data are averaged across all eight sites

• For CORRECT words, only one phone (on average) is misclassified– Implication – short words are highly tolerant of phone “errors”

• For INCORRECT words, phone errors increase linearly with word length


Phone Error and Word Length (2001)• For CORRECT words, only one phone (on average) is misclassified

– Implication – short words are highly tolerant of phone “errors”

• For INCORRECT words, phone errors increase linearly with word length

Phone Error - Forced Alignment (2000)

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

CU

JHU

MSU

SRI

WASH

Err

or

Ra

te

Error Type

AT&T, Dragon did not provide a complete set of forced alignments

Site

• PHONE ERROR RATES VARY BETWEEN 35% AND 49%–This, despite having the word transcript!!!

Phone Error - Forced Alignment (2001)E

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 40% AND 50%–Same picture for 2001. Suggests a potential mismatch between

lexical and phonetic representations

STRICT Time Mediation Transcription UNcompensated


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 30% AND 44%–Still a poor match between phonetic transcripts and lexical reps

LENIENT Time Mediation Transcription UNcompensated


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 32% AND 38%–Still a lack of concordance with a tolerant scoring method

STRICT Time Mediation Transcription Compensated


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

TOTAL SUB DEL INS

BBN

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 23% AND 29%–With the most tolerant scoring there is still some lack of concordance

LENIENT Time Mediation Transcription Compensated

00.20.40.60.8

1

B D G P T K DX JH CH S SH Z ZH F TH V DH M N NX NG L R W Y HH OTH

Visualization of a 3-D Confusion Matrix• When the matrix is sparsely coded, as below, it is more efficient to

view the pattern as if squashed against a brick wall (see below)

The diagonal is plotted in a linear plane

Phonetic Segment

CVC

Pro

po

rtio

n C

on

cord

ance

CVC

Phonetic Confusion Matrix - CVC Syllables• Onset consonants tend to be highly concordant with transcription• Coda consonants are slightly less concordant, particularly some fricatives

Forced AlignmentNumbers refer to the concordance diagonal in the confusion matrices

STOPS NASALSFRICATIVES APPROXIMANTS

Phonetic Segment

CCVC

Pro

po

rtio

n C

on

cord

ance

CVCC

Phonetic Confusions - CCVC, CVCC Syllables• Certain fricatives are problematic in CVCC coda position• Redo this figure and others - no wrong words, compare CVC, CVC etc,



Phonetic Segment

CVC

Pro

po

rtio

n C

on

cord

ance

CV

Phonetic Confusions - CV and CVC Nuclei• Diphthongs and tense, low monophthongs tend to be concordant• Lax monophthongs tend to be less concordant (cf. Stress-accent-paper)


Phone Error - Unconstrained Recognition (2000)E

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

ATT

BBN

CU

DRAGON

JHU

MSU

SRI

WASH

Site

• PHONE ERROR RATES VARY BETWEEN 39% AND 55%–Phone error is only slightly greater than for forced alignments

Phone Error - Unconstrained Recognition(2001)E

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

BBN

IBM

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 44% AND 55%–Results similar to 2000 evaluation

Transcription Uncompensated

Condition most analogous to 2000

evaluation

STRICT Time Mediation

Phone Error - Unconstrained Recognition (2001)E

rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

BBN

IBM

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 38% AND 48%–Relaxing time-mediation brings down the error slightly

LENIENT Time Mediation Transcription Uncompensated


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

BBN

IBM

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 25% AND 39%–Transcription compensation also brings down the error

STRICT Time Mediation Transcription Compensated


rro

r R

ate

Error Type

0

0.1

0.2

0.3

0.4

0.5

0.6

TOTAL SUB DEL INS

BBN

IBM

JHU

MSU

SRI

Site

• PHONE ERROR RATES VARY BETWEEN 27% AND 38%–Phone errors decline somewhat more with lax scoring

LENIENT Time Mediation Transcription Compensated

Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CV Onsets• ARROWS pinpoint problem segments• AFFRICATES and FRICATIVES are problematic in CV onset position• [d] is also problematic

Unconstrained RecognitionNumbers refer to the concordance diagonal in the confusion matrices


Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CVC Onsets• Fricatives and affricates are problematic in CVC onset position



Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CCVC Onsets• Certain fricatives are particularly problematic in CCVC onset position



Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CVC Codas• Fricatives are particularly problematic in CVC coda position• Certain Stops are also problematic in CVC coda position



Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CVCC Codas• Certain fricatives are problematic in CVCC coda position• [d] is also problematic in CVCC coda position



Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CVC Nuclei• Certain vowels are a problem in CVC nucleus position• Note that the level of concordance is much lower for vowels than for consonants (in onset or coda position), even

for correct words


Phonetic Segment

CorrectWords

Pro

po

rtio

n C

on

cord

ance

WrongWords

Phonetic Confusion Matrix - CV Nuclei• Diphthongs and low, tense vowels are more concordant with the

transcription than the lax monophthongs – cf. Stress-accent paper


Consonantal Onsets and AF Errors (2000)• Syllable onsets are intolerant of AF errors in CORRECT words• Place and manner AF errors are particularly high in INCORRECT onsets


Consonantal Onsets and AF Errors (2001)• Syllable onsets are intolerant of AF errors, particularly place, in CORRECT words• Place and manner AF errors are particularly high in INCORRECT onsets• Syllable structure does not have the same effect as in the 2000 analysis


Consonantal Codas and AF Errors (2000)• Syllable codas exhibit a slightly higher tolerance for error than onsets

• There is a high degree of AF error for wrong words


Consonantal Codas and AF Errors (2001)• Syllable codas exhibit a slightly higher tolerance for error than onsets

• There is a high degree of AF error for wrong words


Vocalic Nuclei and AF Errors (2000) • Nuclei exhibit a much higher tolerance for error than onsets & codas• There are many more errors than among syllabic onsets & codas


Vocalic Nuclei and AF Errors (2001) • Nuclei exhibit a much higher tolerance for error than onsets & codas,

particularly for height and front/back• There are many more errors than among syllabic onsets & codas


• WITH THE ARRIVAL OF THE REMAINING FORCED-ALIGNMENT AND UNCONSTRAINED RECOGNITION DATA – IT will be possible to investigate in the relative contribution of the phonetic

classification, pronunciation and language models to recognition performance

– In order to do this, it is necessary to obtain unconstrained recognition, forced alignment and phone-confidence material from each site (to the extent

possible) [the phone confidence metric is problematic]

• CUSTOMIZED ANALYSES FOR INDIVIDUAL SITES– SRI has different versions of their system (with & w/o adaptation, etc.)– AT&T will use phone strings from ICSI transcription material– Individual diagnostics for each site (are there significant differences for specific

parameters?)

• MOST OF THE DATA FOR THE 2001 EVALUATION WILL BE POSTED ON THE PHONEVAL WEB SITE SHORTLY

• WEB-BASED ORACLE DATABASE APPLICATION IS NEAR COMPLETION– Will enable searches over the web of the Phoneval corpus and be able to graph the

results (this is the tricky part, given the ugly nature of Oracle Web DB…)

• A PAPER DESCRIBING THE FULL SET OF ANALYSES WILL BE AVAILABLE AT THE END OF JUNE (2001)

Into the (Near) Future …

• PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS– Many different analyses (to follow) support this conclusion– Consonants appear to be more important than vowels

• SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION– The pattern of errors differs across the syllable (onset, nucleus, coda) and

exhibit consistent patterns difficult to discern with other units of analysis

• STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS– Relation among stress-accent, syllable structure, vocalic identity and length

• THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR– The unit of lexical representation (phones, articulatory features, etc.) is

probably of the utmost importance for optimizing ASR performance

• FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN LANGUAGE

Summary and Conclusions

phonetic dissection of switchboard-corpus automatic speech recognition systems steven greenberg and...

Documents

phonetic classification

phonetic segment

phonetic segments

phoneticsegment level

segmented switchboard

phonetic confusion matrices

phonetic symbol set

evaluations syllable