speech recognition for under-resourced languages using ... › ... › resources ›...

Speech Recognition for Under-resourced Languages using Probabilistic Transcriptions

Preethi Jyothi Department of CSE, IIT Bombay

CS344 Guest Lecture

February 7, 2017

Introduction

Automatic speech recognition (ASR):  Translate spoken words into text

SiriCortana

ASR isn’t to blame for this…

Image from http://takeitwithagrainofsalt.quora.com/Thanks-Siri?srid=pr0Y&share=1

http://takeitwithagrainofsalt.quora.com/Thanks-Siri?srid=pr0Y&share=1

Introduction

Automatic speech recognition (ASR):  Translate spoken words into text

SiriCortana

Modern ASR systems are dominated by statistical methods pioneered by [Jelenik ’76]

WX

p(X,Q,W)

Standard ASR Pipeline

LANGUAGEMODEL

ACOUSTICMODEL

PRONUNCIATIONMODEL

Decoding: Given X, find argmax p(W|X) W

= argmax p(W, X) W

= argmax ∑ p(X,Q,W) W Q

p(X|Q) ≅ p(X|Q,W)

p(Q|W) p(W)

ASR over the years

IEEE

PRO

OF

IEEE SIGNAL PROCESSING MAGAZINE [2] SEPTEMBER 2011

[lecture NOTES] continued

acoustic model, source language model, probabilistic MT model, and target lan-guage model [14], [27]. The integration of ASR and MT in ST has been proposed and implemented mainly at the decod-ing stage. Little work was done on opti-mal integration for the construction or learning of ST systems.

On the other hand, discriminative learning has been used pervasively in recent statistical signal processing and pattern recognition research including ASR and MT separately [7], [13], [26]. In

particular, a unified discriminative train-ing framework was proposed for ASR, in which a variety of discriminative training criteria are consolidated into one single general mathematical form, and a general optimization technique based on growth transformation was developed [7]. Applications of this optimization frame-work to MT and the extension to ST will be presented in this lecture note.

To conclude this brief introduction, we point out that in each of the ASR, MT, and ST fields, a series of benchmark evalua-

tions have been conducted, historically and some continuing until this as well as coming years, with active participation of international research teams. In Table 1, we summarized major benchmark tasks for each of the three fields.

Historically, NIST have conducted a comprehensive series of tests on ASR over two and a half decades, and the perfor-mance of major ASR benchmark tests is summarized in [19]. Here we reproduce the results in Figure 1 for the readers’ reference.

100

10W

ER

(%

)

4

2

Range of Human Error in Transcription

1k

5k

20k

Noisy

ReadSpeech

Air TravelPlanning Kiosk

Speech

BroadcastSpeech

Switchboard

Switchboard II

Conversational Speech

NIST STT Benchmark Test History—May 2009

(Non-English)

(Non-English)

News English Unlimited

News English 10X

News English 1X

CTS Fisher (UL)

News Arabic 10XNews Mandarin 10X

Meeting Speech

CTS Arabic (UL)CTS Mandarin (UL)0

VariedMicrophones

1

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1998

2000

2001

2002

2003

2004

2005

2000

2007

2008

2009

2010

2011

Meeting—MDMOV4Meeting—SDMOV4

Meeting—IHM

[FIG1] After [19], a summarization of historical performance progresses on major ASR benchmark tests conducted by NIST.

[TABLE 1] SUMMARY OF MAJOR BENCHMARK TASKS IN ASR, MT, AND ST.

TASK NAME CATEGORY SCOPE AND SCENARIOS

TIMIT ASR English phonetic recognition; small vocabularyWSJ 0&1 ASR Mid-to-large vocabulary; dictation speechAURORA ASR Small vocabulary, under noisy acoustic environmentsDARPA EARS ASR Large vocabulary, broadcast, and conversational speechNIST MT OPEN EVAL MT Large scale MT, newswire, and Web text WMT (EUROPARL/EUROMATRIX) MT Large scale MT, newswire, and political text C-STAR ST Limited domain spontaneous speechDARPA TRANSTAC ST Limited domain spontaneous dialogIWSLT ST Limited domain dialog and unlimited free-style talk TC-STAR ST Broadcast speech, political speechDARPA GALE ST Broadcast speech, conversational speech

Image reproduced from He & Deng, IEEE Signal Proc. Magazine, 2011

• Great progress in ASR performance

• Aided by algorithmic and computational advances

• Recently: Baidu’s Deep Speech 2 comparable to human performance

• Trained on about 10,000 hours of labeled speech 

• But limited language diversity

Languages with ASR

E.g., Google Voice Search supports < 80 out of 7000 languages

Frac

tion

in G

oogl

e Vo

ice

Sear

ch

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

Africa Americas Asia Europe Pacific

https://googleblog.blogspot.com/2012/08/voice-search-arrives-in-13-new-languages.html

https://googleblog.blogspot.com/2012/08/voice-search-arrives-in-13-new-languages.html

ASR for all languages

• ASR systems in all languages?

• Speech is the primary means of human communication

• Develop natural interfaces for both literate & illiterate users

• Contribute to preservation of endangered languages

Lack of Transcribed Corpora

• Major challenge: Building ASR systems is very data-hungry • Require large amounts of labeled speech data: Speech audio

with matching transcriptions

• Transcription by native speakers is a laborious and expensive process

• Crowdsourcing might help alleviate the problem • However, significant mismatch in native languages of crowd

workers and native language populations in the world

Native Language Mismatch

• Very few (to zero) crowd workers speak minority languages • Distributional mismatch between language background of crowd

workers with the language expertise required to complete transcription tasks

Mismatched Crowdsourcing

• A major bottleneck for ASR in new languages: Labeled speech

• Transcribers need to be native speakers

Use Non-native Speakers?

How can it possibly work?!1


1[Jyothi & Hasegawa-Johnson AAAI-15 & Interspeech-15]


• How can it possibly work?! [Best ’94, Flege ’95, etc.] • We are typically bad at perceiving speech in foreign

languages!

• Unfamiliar sounds, no vocabulary, no language model to go by, distorted by native languages, …


• How can it possibly work?! [Best ’94, Flege ’95, etc.] • We are typically bad at perceiving speech in foreign

languages!

• Unfamiliar sounds, no vocabulary, no language model to go by, distorted by native languages, …

• An extremely noisy channel


• An extremely noisy channel

मेरा नाम राम ह ै मैं यहां पहली बार …

mara number mail

bar..

ENCODER(Speaker) (Transcriber)

CHANNEL DECODERX Y X̃ AX Y X̃ A X Y X̃ AX Y X̃ A

Solution: Error Correction

• Learn channel characteristics of the foreign listener


mara number mail

bar..

ENCODER(Speaker)


(Transcriber)


mara number mail

bar..


• Learn channel characteristics of the foreign listener

ENCODER(Speaker)


(Transcriber)

A Probabilistic Finite State model         

Trained using the Expectation-Maximization (EM) algorithm

x = [tʃ] [iː] [k]

1𝜖 : h/1

2[iː] : e/0.6

𝜖 : e/1

[k] : k/0.6 [tʃ] : c/0.7

𝜖

[k] : g/0.6p ( “chieg” | x) = 0.11

p ( “cheek” | x) = 0.25

Visualizing the Channel

m s n l a k o p t r i i e h j æ

Hindi phones (IPA)

Con

ditio

nal p

roba

bilit

y of

Eng

lish

lette

rs g

iven

Hin

di p

hone

s

0.0

0.2

0.4

0.6

0.8

1.0

ms n l a

k o p t a ri i

eh y e

n

n

z m

el oc u b

de

le ee

a

ia

es l

nah g oo pa

po

d

ae

i

iy

i

shin

llar

a

u

ar

ee

y

ee

eh


• Learn channel characteristics of the foreign listener • But also need to use an error-correcting code


mara number mail

bar..

ENCODER(Speaker)


But how can we encode speech?

(Transcriber)


• Learn channel characteristics of the foreign listener • Use a repetition code!


mara number mail

bar..

ENCODER(Speaker)


(Transcriber)


mara number mail

bar..


• Learn channel characteristics of the foreign listener • Use a repetition code!

Speaker Repeat

ENCODERIndependent Invocations

of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ ADECODER

Example

paḍegā

paḍegā

pratī

paḍā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrā

pūrāpūrā

CU

MU

LATI

VE

DEC

OD

ING

puda

waatap

fotup

puddop

pooda

puda

purap

poduck

purap

foodap

Labeling Error Rates

• Impressive accuracy (~5% error) on a medium-vocabulary isolated-word task!

Erro

r Rat

e (%

)

05

1015202530354045505560

Number of Repetitions1 2 4 8 10 12 15

1-best 2-best

ASR Error Rate: 26.5%

[Jyothi & Hasegawa-Johnson AAAI-15]

Information-theoretic Analysis

• Conditional entropy, H(X | Y), of the spoken words (Hindi), X, given the crowd transcripts Y, captures the amount of information lost in transmission

• H(X | Y) can be naively upper-bounded using corpus cross-entropy

• However, errors in our channel model accumulate with increase in the number of repetitions, resulting in this upper-bound becoming less tight

Information-theoretic Analysis

Tighter bound on H(X | Y) using an auxiliary random variable, Z ∈{0,1}

Consider W = 𝜖 when Z = 0, W = X when Z = 1 We set Z = 1 when q(x|y) is sufficiently low

W represents side channel information indicating when the model needs to be corrected

H(X | Y) ≤ p0 · H(X | Y, Z=0) + H(Z) + (1 − p0) log |X|

where p0 = p(Z=0) and X is the input alphabet

Upper-bounded using corpus cross-entropy

Conditional Entropy EstimatesC

E es

timat

e (in

bits

)

0

1.25

2.5

3.75

5

Number of Turkers

1 2 4 8 10 12 15

Naive-CE Tighter-CE

Our upper-bound estimates for CE are clearly tighter than the naive cross-entropy upper-bound estimate

Mismatched Crowdsourcing for  Continuous Speech

February 7, 2017

Continuous Speech

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ A

Exact Maximum-Likelihood Decoding of multiple

strings is intractable for long utterances

DECODER

Decode each transcript individually using Maximum

Likelihood Decoding and pick the output with the

best score

Word error rate: 77%

Continuous Speech

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ A

Can we do better? An outlier with a good score

shouldn’t be chosen over what many similar looking transcripts

predict

DECODER

Decode each transcript individually using Maximum

Likelihood Decoding and pick the output with the

best score


Continuous Speech

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ A



predict

DECODER

Data Filtering: Use an edit-distance based similarity metric to discover

a “cluster” to retain.

Data Filtering

1

2

3

4

5

4

2

1

3

5

4

4 2

1

a v i e m e k ea b i a n - k ea v e a m e g ia v e a n - k aa b i a n - k a

Continuous Speech

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ A



predict

DECODER




Continuous Speech

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ A

Can we do even better?

DECODER




Channel Merger

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A

X Y X̃ A X Y X̃ AMerged  Channel  Decoder

Merge

ENCODER

Alignment

Approximation  via incremental

alignment algorithm

NP-hard!

Discover “typical” 

transcripts

Data Filtering

Alignment

k e e - a - j a - g a -

- g i y a - j a y g a -

k e e - a - j a y g a h

- - c h a i j e - g a -

kiyā jāyegā

k e e a j a g a  g i y a j a y g a  

k e e a j a y g a h  c h a i j e g a

Alignment

k E - a j a g a -

g i y a j Y g a -

k E - a j Y g a h

C - - Y j e g a -

kiyā jāyegā

k e e a j a g a  g i y a j a y g a  

k e e a j a y g a h  c h a i j e g a

Channel Merger

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A


Merge

ENCODER

MergeMerge into one

probabilistic transcript

Alignment


alignment algorithm

NP-hard!


transcripts

Data Filtering

Merge Transcripts

k E - a j a g a -

g i y a j Y g a -

k E - a j Y g a h

C - - Y j e g a -

kiyā jāyegā

a/0.75

Y/0.25

j/1 Y/0.5

e/0.25

g/1 ……

Channel Merger

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A


Merge

ENCODER

MergeMerge into one


Model for merged channel

Alignment


alignment algorithm

NP-hard!


transcripts

Data Filtering

Channel Merger

Speaker Repeat


of CHANNEL

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

Y(1)

Y(2)

. . . Y(k)

⋮

X Y X̃ A

X Y X̃ A

X Y X̃ A


Merge

ENCODER

MergeMerge into one


Model for merged channel

Alignment


alignment algorithm

NP-hard!


transcripts

Shortlist & Decode

List Decoding

+ Exact Decoding from List

Data Filtering

Probabilistic Transcriptions

Tacapo pizastrucka po zapechamtrakapo trabizaStraka pose ta peesomestraka po ta pishastrah kah poh chah peesh umchaka-pu shapishastakkappoo sabeeshamtakapo chapiserStrike a pose some pizza

1[P. Jyothi & Hasegawa-Johnson, Interspeech-15]

Probabilistic Transcriptions

Tacapo pizastrucka po zapechamtrakapo trabizaStraka pose ta peesomestraka po ta pishastrah kah poh chah peesh umchaka-pu shapishastakkappoo sabeeshamtakapo chapiserStrike a pose some pizza

Probabilistic phone-based transcriptions derived from alignments1

s t r a - k a p o z t a - p E - s o m

s t r a - k a p o - t a - p i - S a -

s t r a h k a p o - C a h p E - S u m

s t r Y - k a p o z s a m p E t s a -

s tʰ r ɑ/ɑi

kʰ ɑ pʰ ɔ tʃʰ/t/s

ɑ pʰ i/i: ʃ/s ɑ

1[P. Jyothi & Hasegawa-Johnson, Interspeech-15]

Transcription Error Rates E

rror R

ates

(%)

0

20

40

60

80

Individual Merged Decoding from Shortlist

With data filtering Phone Error RatesNo data filtering

[Jyothi & Hasegawa-Johnson Interspeech-15]

44% drop!

Adapting ASR Systems using  Mismatched Transcriptions

February 7, 2017

Next Step?

• Respectable accuracy from mismatched transcriptions • But can this be leveraged for building ASR systems?

• Plan: Baseline ASR trained on other languages will be adapted using mismatched transcriptions

• Baseline could use data-hungry technology like Deep Neural Networks (DNNs)

• Project at 2015 Jelenik Summer Workshop [JSALT ’15] • Several languages considered: Hungarian, Mandarin, Swahili etc.

More than meets the eye

• Measuring additional information in probabilistic transcripts:

• How error rates fall when more “advice” is made available to the decoder

Phon

e er

ror r

ates

(%)

30

35

40

45

50

55

60

65

70

75

Advice per phone (in bits)0 0.2 0.4 .6 .8 1

HungarianMandarinSwahili

• Mismatched transcripts too noisy to be used in the traditional way for ASR training

• Use as probabilistic transcripts

a/0.75

Y/0.25

j/1 Y/0.5

e/0.25

g/1 ……

ASR Systems for Comparison

• Multilingual: Train on 6 languages (Arabic, Cantonese, Dutch, Hungarian, Mandarin, Urdu) and test on a new target language (Swahili).

• Semi-supervised DNN: Transcribe unlabeled audio from the target language using a DNN-based multilingual ASR system and use it to further re-train the DNN models.

Mismatched Transcriptions for ASR

[Liu*, Jyothi* et al. ICASSP’ 16]

25% drop!

Phon

e Er

ror R

ates

(%)

0

20

40

60

80

Swahili Mandarin Hungarian

Multilingual Semi-supervised DNN + Adaptation

Native Language Backgrounds of  Mismatched Crowds

February 7, 2017

Channel Selection

• Can we do better by selecting the language background of the transcribers?

• How should we select the transcribers?

• Understanding when phones get misperceived

• How is this correlated with transcribers’ language background

Understanding the Mismatched Channel

• When are two phones confused with each other?

• If they are “phonologically close” to each other

• Phonological distance between two phones

• Use distinctive features (DF) from linguistic theory [Chomsky & Halle, ’68] [Phoible ’15]

37 DFs nasal tone

sonorant labial trill

front back

:• Phones as “code words” in the DF-space. 

Hamming distance measures phone contrast.

Distance Distribution of the Code

[Varshney, Jyothi & Hasegawa-Johnson ITA-16]

Distance Distribution of the Code

• Codes for different languages exhibit similar distributions

• Phone confusion quantified using the total variational distance between the output distributions of the channel

• We call it phone-pair distinction

Phones Letters

• Hypothesis: Phonological distance in the DF-space correlated  with phone confusion in the mismatched channel

Understanding the Mismatched Channel

0

10

20

30

0.00 0.25 0.50 0.75Phone pair distinction by English speakers

Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

10

20

30

Frequency

Phone Pair Distinction vs. DF-Distance

0

10

20

30

0.00 0.25 0.50 0.75 1.00Phone pair distinction by Mandarin speakers

Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

1020304050

Frequency

0

10

20

30


Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

10

20

30

Frequency

Phone Pair Distinction

0.00

0.25

0.50

0.75

1.00


Phon

e pa

ir di

stin

ctio

n by

Man

darin

spe

aker

s

01234

Frequency (log scale)

0

10

20

30


Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

1020304050

Frequency

0

10

20

30


Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

10

20

30

Frequency


• Clearly, DF-distance positively correlated with phone-pair distinction

• Difference across native language backgrounds

• Different DFs are prominent in different languages

0

10

20

30


Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

1020304050

Frequency

0

10

20

30


Ham

min

g di

stan

ce b

etwe

en p

hone

pai

rs

10

20

30

Frequency


• Clearly, DF-distance positively correlated with phone-pair distinction

• Difference across native language backgrounds

• Different DFs are prominent in different languages

• Ongoing work: A model that takes into account DF presence/prominence

• ASR for low-resource languages presents challenging research problems

• In this talk:

• Establish the possibility of acquiring  speech transcriptions using mismatched  crowds

• Demonstrate the impact of mismatched  transcriptions on ASR performance

• Investigate relation of transcriber native languages with phone confusion

Summary

• Future research: Optimally select mismatched transcribers to  further improve impact of mismatched transcriptions

Summary

1Based on joint works with Mark Hasegawa-Johnson, Lav Varshney and participants at the 2015 Jelinek Summer Workshop.

• ASR for low-resource languages presents challenging research problems

• In this talk:

• Establish the possibility of acquiring  speech transcriptions using mismatched  crowds

• Demonstrate the impact of mismatched  transcriptions on ASR performance

• Investigate relation of transcriber native languages with phone confusion

speech recognition for under-resourced languages using ... › ... › resources ›...

Documents