speech recognition for under-resourced languages using ... › ... › resources ›...
TRANSCRIPT
-
Speech Recognition for Under-resourced Languages using Probabilistic Transcriptions
Preethi Jyothi Department of CSE, IIT Bombay
CS344 Guest Lecture
February 7, 2017
-
Introduction
Automatic speech recognition (ASR): Translate spoken words into text
SiriCortana
-
ASR isn’t to blame for this…
Image from http://takeitwithagrainofsalt.quora.com/Thanks-Siri?srid=pr0Y&share=1
http://takeitwithagrainofsalt.quora.com/Thanks-Siri?srid=pr0Y&share=1
-
Introduction
Automatic speech recognition (ASR): Translate spoken words into text
SiriCortana
Modern ASR systems are dominated by statistical methods pioneered by [Jelenik ’76]
-
WX
p(X,Q,W)
Standard ASR Pipeline
LANGUAGEMODEL
ACOUSTICMODEL
PRONUNCIATIONMODEL
Decoding: Given X, find argmax p(W|X) W
= argmax p(W, X) W
= argmax ∑ p(X,Q,W) W Q
p(X|Q) ≅ p(X|Q,W)
p(Q|W) p(W)
-
ASR over the years
IEEE
PRO
OF
IEEE SIGNAL PROCESSING MAGAZINE [2] SEPTEMBER 2011
[lecture NOTES] continued
acoustic model, source language model, probabilistic MT model, and target lan-guage model [14], [27]. The integration of ASR and MT in ST has been proposed and implemented mainly at the decod-ing stage. Little work was done on opti-mal integration for the construction or learning of ST systems.
On the other hand, discriminative learning has been used pervasively in recent statistical signal processing and pattern recognition research including ASR and MT separately [7], [13], [26]. In
particular, a unified discriminative train-ing framework was proposed for ASR, in which a variety of discriminative training criteria are consolidated into one single general mathematical form, and a general optimization technique based on growth transformation was developed [7]. Applications of this optimization frame-work to MT and the extension to ST will be presented in this lecture note.
To conclude this brief introduction, we point out that in each of the ASR, MT, and ST fields, a series of benchmark evalua-
tions have been conducted, historically and some continuing until this as well as coming years, with active participation of international research teams. In Table 1, we summarized major benchmark tasks for each of the three fields.
Historically, NIST have conducted a comprehensive series of tests on ASR over two and a half decades, and the perfor-mance of major ASR benchmark tests is summarized in [19]. Here we reproduce the results in Figure 1 for the readers’ reference.
100
10W
ER
(%
)
4
2
Range of Human Error in Transcription
1k
5k
20k
Noisy
ReadSpeech
Air TravelPlanning Kiosk
Speech
BroadcastSpeech
Switchboard
Switchboard II
Conversational Speech
NIST STT Benchmark Test History—May 2009
(Non-English)
(Non-English)
News English Unlimited
News English 10X
News English 1X
CTS Fisher (UL)
News Arabic 10XNews Mandarin 10X
Meeting Speech
CTS Arabic (UL)CTS Mandarin (UL)0
VariedMicrophones
1
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1998
2000
2001
2002
2003
2004
2005
2000
2007
2008
2009
2010
2011
Meeting—MDMOV4Meeting—SDMOV4
Meeting—IHM
[FIG1] After [19], a summarization of historical performance progresses on major ASR benchmark tests conducted by NIST.
[TABLE 1] SUMMARY OF MAJOR BENCHMARK TASKS IN ASR, MT, AND ST.
TASK NAME CATEGORY SCOPE AND SCENARIOS
TIMIT ASR English phonetic recognition; small vocabularyWSJ 0&1 ASR Mid-to-large vocabulary; dictation speechAURORA ASR Small vocabulary, under noisy acoustic environmentsDARPA EARS ASR Large vocabulary, broadcast, and conversational speechNIST MT OPEN EVAL MT Large scale MT, newswire, and Web text WMT (EUROPARL/EUROMATRIX) MT Large scale MT, newswire, and political text C-STAR ST Limited domain spontaneous speechDARPA TRANSTAC ST Limited domain spontaneous dialogIWSLT ST Limited domain dialog and unlimited free-style talk TC-STAR ST Broadcast speech, political speechDARPA GALE ST Broadcast speech, conversational speech
Image reproduced from He & Deng, IEEE Signal Proc. Magazine, 2011
• Great progress in ASR performance
• Aided by algorithmic and computational advances
• Recently: Baidu’s Deep Speech 2 comparable to human performance
• Trained on about 10,000 hours of labeled speech
• But limited language diversity
-
Languages with ASR
E.g., Google Voice Search supports < 80 out of 7000 languages
Frac
tion
in G
oogl
e Vo
ice
Sear
ch
0%
1%
2%
3%
4%
5%
6%
7%
8%
9%
10%
Africa Americas Asia Europe Pacific
https://googleblog.blogspot.com/2012/08/voice-search-arrives-in-13-new-languages.html
https://googleblog.blogspot.com/2012/08/voice-search-arrives-in-13-new-languages.html
-
ASR for all languages
• ASR systems in all languages?
• Speech is the primary means of human communication
• Develop natural interfaces for both literate & illiterate users
• Contribute to preservation of endangered languages
-
Lack of Transcribed Corpora
• Major challenge: Building ASR systems is very data-hungry • Require large amounts of labeled speech data: Speech audio
with matching transcriptions
• Transcription by native speakers is a laborious and expensive process
• Crowdsourcing might help alleviate the problem • However, significant mismatch in native languages of crowd
workers and native language populations in the world
-
Native Language Mismatch
• Very few (to zero) crowd workers speak minority languages • Distributional mismatch between language background of crowd
workers with the language expertise required to complete transcription tasks
-
Mismatched Crowdsourcing
• A major bottleneck for ASR in new languages: Labeled speech
• Transcribers need to be native speakers
Use Non-native Speakers?
How can it possibly work?!1
Mismatched Crowdsourcing
1[Jyothi & Hasegawa-Johnson AAAI-15 & Interspeech-15]
-
Mismatched Crowdsourcing
• How can it possibly work?! [Best ’94, Flege ’95, etc.] • We are typically bad at perceiving speech in foreign
languages!
• Unfamiliar sounds, no vocabulary, no language model to go by, distorted by native languages, …
-
Mismatched Crowdsourcing
• How can it possibly work?! [Best ’94, Flege ’95, etc.] • We are typically bad at perceiving speech in foreign
languages!
• Unfamiliar sounds, no vocabulary, no language model to go by, distorted by native languages, …
• An extremely noisy channel
-
Mismatched Crowdsourcing
• An extremely noisy channel
मेरा नाम राम ह ै मैं यहां पहली बार …
mara number mail
bar..
ENCODER(Speaker) (Transcriber)
CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A
-
Solution: Error Correction
• Learn channel characteristics of the foreign listener
मेरा नाम राम ह ै मैं यहां पहली बार …
mara number mail
bar..
ENCODER(Speaker)
CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A
(Transcriber)
-
मेरा नाम राम ह ै मैं यहां पहली बार …
mara number mail
bar..
Solution: Error Correction
• Learn channel characteristics of the foreign listener
ENCODER(Speaker)
CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A
(Transcriber)
A Probabilistic Finite State model
Trained using the Expectation-Maximization (EM) algorithm
x = [tʃ] [iː] [k]
1𝜖 : h/1
2[iː] : e/0.6
𝜖 : e/1
[k] : k/0.6 [tʃ] : c/0.7
𝜖
[k] : g/0.6p ( “chieg” | x) = 0.11
p ( “cheek” | x) = 0.25
-
Visualizing the Channel
m s n l a k o p t r i i e h j æ
Hindi phones (IPA)
Con
ditio
nal p
roba
bilit
y of
Eng
lish
lette
rs g
iven
Hin
di p
hone
s
0.0
0.2
0.4
0.6
0.8
1.0
ms n l a
k o p t a ri i
eh y e
n
n
z m
el oc u b
de
le ee
a
ia
es l
nah g oo pa
po
d
ae
i
iy
i
shin
llar
a
u
ar
ee
y
ee
eh
-
Solution: Error Correction
• Learn channel characteristics of the foreign listener • But also need to use an error-correcting code
मेरा नाम राम ह ै मैं यहां पहली बार …
mara number mail
bar..
ENCODER(Speaker)
CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A
But how can we encode speech?
(Transcriber)
-
Solution: Error Correction
• Learn channel characteristics of the foreign listener • Use a repetition code!
मेरा नाम राम ह ै मैं यहां पहली बार …
mara number mail
bar..
ENCODER(Speaker)
CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A
(Transcriber)
-
मेरा नाम राम ह ै मैं यहां पहली बार …
mara number mail
bar..
Solution: Error Correction
• Learn channel characteristics of the foreign listener • Use a repetition code!
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ ADECODER
-
Example
paḍegā
paḍegā
pratī
paḍā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrā
pūrāpūrā
CU
MU
LATI
VE
DEC
OD
ING
puda
waatap
fotup
puddop
pooda
puda
purap
poduck
purap
foodap
-
Labeling Error Rates
• Impressive accuracy (~5% error) on a medium-vocabulary isolated-word task!
Erro
r Rat
e (%
)
05
1015202530354045505560
Number of Repetitions1 2 4 8 10 12 15
1-best 2-best
ASR Error Rate: 26.5%
[Jyothi & Hasegawa-Johnson AAAI-15]
-
Information-theoretic Analysis
• Conditional entropy, H(X | Y), of the spoken words (Hindi), X, given the crowd transcripts Y, captures the amount of information lost in transmission
• H(X | Y) can be naively upper-bounded using corpus cross-entropy
• However, errors in our channel model accumulate with increase in the number of repetitions, resulting in this upper-bound becoming less tight
-
Information-theoretic Analysis
Tighter bound on H(X | Y) using an auxiliary random variable, Z ∈{0,1}
Consider W = 𝜖 when Z = 0, W = X when Z = 1 We set Z = 1 when q(x|y) is sufficiently low
W represents side channel information indicating when the model needs to be corrected
H(X | Y) ≤ p0 · H(X | Y, Z=0) + H(Z) + (1 − p0) log |X|
where p0 = p(Z=0) and X is the input alphabet
Upper-bounded using corpus cross-entropy
-
Conditional Entropy EstimatesC
E es
timat
e (in
bits
)
0
1.25
2.5
3.75
5
Number of Turkers
1 2 4 8 10 12 15
Naive-CE Tighter-CE
Our upper-bound estimates for CE are clearly tighter than the naive cross-entropy upper-bound estimate
-
Mismatched Crowdsourcing for Continuous Speech
February 7, 2017
-
Continuous Speech
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ A
Exact Maximum-Likelihood Decoding of multiple
strings is intractable for long utterances
DECODER
Decode each transcript individually using Maximum
Likelihood Decoding and pick the output with the
best score
Word error rate: 77%
-
Continuous Speech
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ A
Can we do better? An outlier with a good score
shouldn’t be chosen over what many similar looking transcripts
predict
DECODER
Decode each transcript individually using Maximum
Likelihood Decoding and pick the output with the
best score
Word error rate: 77%
-
Continuous Speech
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ A
Can we do better? An outlier with a good score
shouldn’t be chosen over what many similar looking transcripts
predict
DECODER
Data Filtering: Use an edit-distance based similarity metric to discover
a “cluster” to retain.
-
Data Filtering
1
2
3
4
5
4
2
1
3
5
4
4 2
1
a v i e m e k ea b i a n - k ea v e a m e g ia v e a n - k aa b i a n - k a
-
Continuous Speech
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ A
Can we do better? An outlier with a good score
shouldn’t be chosen over what many similar looking transcripts
predict
DECODER
Data Filtering: Use an edit-distance based similarity metric to discover
a “cluster” to retain.
Word error rate: 68%
-
Continuous Speech
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ A
Can we do even better?
DECODER
Data Filtering: Use an edit-distance based similarity metric to discover
a “cluster” to retain.
Word error rate: 68%
-
Channel Merger
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ AMerged Channel Decoder
Merge
ENCODER
Alignment
Approximation via incremental
alignment algorithm
NP-hard!
Discover “typical”
transcripts
Data Filtering
-
Alignment
k e e - a - j a - g a -
- g i y a - j a y g a -
k e e - a - j a y g a h
- - c h a i j e - g a -
kiyā jāyegā
k e e a j a g a g i y a j a y g a
k e e a j a y g a h c h a i j e g a
-
Alignment
k E - a j a g a -
g i y a j Y g a -
k E - a j Y g a h
C - - Y j e g a -
kiyā jāyegā
k e e a j a g a g i y a j a y g a
k e e a j a y g a h c h a i j e g a
-
Channel Merger
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ AMerged Channel Decoder
Merge
ENCODER
MergeMerge into one
probabilistic transcript
Alignment
Approximation via incremental
alignment algorithm
NP-hard!
Discover “typical”
transcripts
Data Filtering
-
Merge Transcripts
k E - a j a g a -
g i y a j Y g a -
k E - a j Y g a h
C - - Y j e g a -
kiyā jāyegā
a/0.75
Y/0.25
j/1 Y/0.5
e/0.25
g/1 ……
-
Channel Merger
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ AMerged Channel Decoder
Merge
ENCODER
MergeMerge into one
probabilistic transcript
Model for merged channel
Alignment
Approximation via incremental
alignment algorithm
NP-hard!
Discover “typical”
transcripts
Data Filtering
-
Channel Merger
Speaker Repeat
ENCODERIndependent Invocations
of CHANNEL
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
Y(1)
Y(2)
. . . Y(k)
⋮
X Y X̃ A
X Y X̃ A
X Y X̃ A
X Y X̃ A X Y X̃ AMerged Channel Decoder
Merge
ENCODER
MergeMerge into one
probabilistic transcript
Model for merged channel
Alignment
Approximation via incremental
alignment algorithm
NP-hard!
Discover “typical”
transcripts
Shortlist & Decode
List Decoding
+ Exact Decoding from List
Data Filtering
-
Probabilistic Transcriptions
Tacapo pizastrucka po zapechamtrakapo trabizaStraka pose ta peesomestraka po ta pishastrah kah poh chah peesh umchaka-pu shapishastakkappoo sabeeshamtakapo chapiserStrike a pose some pizza
1[P. Jyothi & Hasegawa-Johnson, Interspeech-15]
-
Probabilistic Transcriptions
Tacapo pizastrucka po zapechamtrakapo trabizaStraka pose ta peesomestraka po ta pishastrah kah poh chah peesh umchaka-pu shapishastakkappoo sabeeshamtakapo chapiserStrike a pose some pizza
Probabilistic phone-based transcriptions derived from alignments1
s t r a - k a p o z t a - p E - s o m
s t r a - k a p o - t a - p i - S a -
s t r a h k a p o - C a h p E - S u m
s t r Y - k a p o z s a m p E t s a -
s tʰ r ɑ/ɑi
kʰ ɑ pʰ ɔ tʃʰ/t/s
ɑ pʰ i/i: ʃ/s ɑ
1[P. Jyothi & Hasegawa-Johnson, Interspeech-15]
-
Transcription Error Rates E
rror R
ates
(%)
0
20
40
60
80
Individual Merged Decoding from Shortlist
With data filtering Phone Error RatesNo data filtering
[Jyothi & Hasegawa-Johnson Interspeech-15]
44% drop!
-
Adapting ASR Systems using Mismatched Transcriptions
February 7, 2017
-
Next Step?
• Respectable accuracy from mismatched transcriptions • But can this be leveraged for building ASR systems?
• Plan: Baseline ASR trained on other languages will be adapted using mismatched transcriptions
• Baseline could use data-hungry technology like Deep Neural Networks (DNNs)
• Project at 2015 Jelenik Summer Workshop [JSALT ’15] • Several languages considered: Hungarian, Mandarin, Swahili etc.
-
More than meets the eye
• Measuring additional information in probabilistic transcripts:
• How error rates fall when more “advice” is made available to the decoder
Phon
e er
ror r
ates
(%)
30
35
40
45
50
55
60
65
70
75
Advice per phone (in bits)0 0.2 0.4 .6 .8 1
HungarianMandarinSwahili
• Mismatched transcripts too noisy to be used in the traditional way for ASR training
• Use as probabilistic transcripts
a/0.75
Y/0.25
j/1 Y/0.5
e/0.25
g/1 ……
-
ASR Systems for Comparison
• Multilingual: Train on 6 languages (Arabic, Cantonese, Dutch, Hungarian, Mandarin, Urdu) and test on a new target language (Swahili).
• Semi-supervised DNN: Transcribe unlabeled audio from the target language using a DNN-based multilingual ASR system and use it to further re-train the DNN models.
-
Mismatched Transcriptions for ASR
[Liu*, Jyothi* et al. ICASSP’ 16]
25% drop!
Phon
e Er
ror R
ates
(%)
0
20
40
60
80
Swahili Mandarin Hungarian
Multilingual Semi-supervised DNN + Adaptation
-
Native Language Backgrounds of Mismatched Crowds
February 7, 2017
-
Channel Selection
• Can we do better by selecting the language background of the transcribers?
• How should we select the transcribers?
• Understanding when phones get misperceived
• How is this correlated with transcribers’ language background
-
Understanding the Mismatched Channel
• When are two phones confused with each other?
• If they are “phonologically close” to each other
• Phonological distance between two phones
• Use distinctive features (DF) from linguistic theory [Chomsky & Halle, ’68] [Phoible ’15]
37 DFs nasal tone
sonorant labial trill
front back
:• Phones as “code words” in the DF-space.
Hamming distance measures phone contrast.
-
Distance Distribution of the Code
[Varshney, Jyothi & Hasegawa-Johnson ITA-16]
-
Distance Distribution of the Code
• Codes for different languages exhibit similar distributions
-
• Phone confusion quantified using the total variational distance between the output distributions of the channel
• We call it phone-pair distinction
Phones Letters
• Hypothesis: Phonological distance in the DF-space correlated with phone confusion in the mismatched channel
Understanding the Mismatched Channel
-
0
10
20
30
0.00 0.25 0.50 0.75Phone pair distinction by English speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
10
20
30
Frequency
Phone Pair Distinction vs. DF-Distance
-
0
10
20
30
0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
1020304050
Frequency
0
10
20
30
0.00 0.25 0.50 0.75Phone pair distinction by English speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
10
20
30
Frequency
Phone Pair Distinction
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75Phone pair distinction by English speakers
Phon
e pa
ir di
stin
ctio
n by
Man
darin
spe
aker
s
01234
Frequency (log scale)
-
0
10
20
30
0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
1020304050
Frequency
0
10
20
30
0.00 0.25 0.50 0.75Phone pair distinction by English speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
10
20
30
Frequency
Phone Pair Distinction
• Clearly, DF-distance positively correlated with phone-pair distinction
• Difference across native language backgrounds
• Different DFs are prominent in different languages
-
0
10
20
30
0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
1020304050
Frequency
0
10
20
30
0.00 0.25 0.50 0.75Phone pair distinction by English speakers
Ham
min
g di
stan
ce b
etwe
en p
hone
pai
rs
10
20
30
Frequency
Phone Pair Distinction
• Clearly, DF-distance positively correlated with phone-pair distinction
• Difference across native language backgrounds
• Different DFs are prominent in different languages
• Ongoing work: A model that takes into account DF presence/prominence
-
• ASR for low-resource languages presents challenging research problems
• In this talk:
• Establish the possibility of acquiring speech transcriptions using mismatched crowds
• Demonstrate the impact of mismatched transcriptions on ASR performance
• Investigate relation of transcriber native languages with phone confusion
Summary
• Future research: Optimally select mismatched transcribers to further improve impact of mismatched transcriptions
-
Summary
1Based on joint works with Mark Hasegawa-Johnson, Lav Varshney and participants at the 2015 Jelinek Summer Workshop.
• ASR for low-resource languages presents challenging research problems
• In this talk:
• Establish the possibility of acquiring speech transcriptions using mismatched crowds
• Demonstrate the impact of mismatched transcriptions on ASR performance
• Investigate relation of transcriber native languages with phone confusion