markpong jongtaveesataporn † chai wutiwiwatchai ‡ koji iwano † sadaoki furui † † tokyo...
TRANSCRIPT
![Page 1: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/1.jpg)
THAI BROADCAST NEWS CORPUS CONSTRUCTION AND EVALUATION
Markpong Jongtaveesataporn †
Chai Wutiwiwatchai ‡
Koji Iwano †
Sadaoki Furui †
† Tokyo Institute of Technology, Japan ‡NECTEC, Thailand
![Page 2: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/2.jpg)
Background on Thai speech recognition research
2
1987
Isolated syllable recogniti
on
1995
Isolated word
recognition
Connected sub-word
recognition
1999
Small task continuous
speech recognition
2003
LVCSR
2005
Broadcast news
transcription system
2007
Difficulty
Thienlikit et al., 2004• Newspaper read-speech recognition
![Page 3: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/3.jpg)
Development of Thai Broadcast News Transcription System• Research on broadcast news transcription
system for Thai falls behind other languages• English: 1995 (Stern, 1997)• Japanese: 1997 (Matsuoka et al., 1997)• Mandarin: 1998 (Guo et al., 1998)• Italian: 2000 (Federico et al., 2000)
• We need to speed up our research activities to catch up with others
3
Targets
1. Development of Thai broadcast news corpus• Speech corpus: training and testing data• Text corpus: language modeling
2. Development of a prototype system
![Page 4: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/4.jpg)
Speech corpus
Structure information of broadcast news was annotated Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribed
Transcription and annotation was created by one transcriber and checked by another transcriber
4
![Page 5: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/5.jpg)
Episode : one broadcast news session
Structure of broadcast news
5
Section 1 : one news topicSection 1 : one news topic
Section 2
Section 3
![Page 6: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/6.jpg)
Episode : one broadcast news session
Section 1 : one news topic
Structure of broadcast news
5
Speaker’s turn : speaker ASpeaker’s turn : speaker A
Speaker’s turn : speaker B
Speaker’s turn : speaker A
![Page 7: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/7.jpg)
Episode : one broadcast news session
Structure of broadcast news
7
Section 1 : one news topic
Speaker’s turn : speaker A
Segment : one sentence or clause
Segment : one sentence or clause
Segment : one sentence or clause
![Page 8: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/8.jpg)
Speech corpus
Structure information of broadcast news was annotated Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribed
Transcription and annotation was created by one transcriber and checked by another transcriber
8
![Page 9: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/9.jpg)
Episode : one broadcast news session
Example of structure information
9
Section 1 :
Speaker’s turn :
Segment : sentence A
Segment : sentence B
Segment : sentence C
Sports
Mr. A, male, planned speech, clean speech
![Page 10: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/10.jpg)
Speech corpus
Structure information of broadcast news was annotated Section, Speaker’s turn, Segments
Property tags were annotated to each speaker’s turn Speaker’s name, if known Speaker’s gender: male / female Speaking mode: planned / spontaneous Background noise: clean / music / noise
Only speech from announcers speaking in the studio was transcribed
Transcription and annotation was created by one transcriber and checked by another transcriber
10
![Page 11: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/11.jpg)
Text corpus
No structure information was annotated
Additional information Speaking mode: planned / spontaneous
11
![Page 12: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/12.jpg)
Problems of Thai transcription text No space between words Definition of word is very ambiguous No good morphological analyzer Difficulties in transcription and checking process
Manually word-segmented transcription was made Instruction was created for transcribers
Automatically segmented transcription
12
Future
target
![Page 13: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/13.jpg)
Broadcast news collection
News programs from one public TV station in Thailand were recorded
Total of 105 news episodes Speech corpus : 35 news episodes 17
hours Text corpus : 70 news episodes
13
![Page 14: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/14.jpg)
Analysis of speech corpus
14
Back-ground
Mode
Gender female male
planned
sponta-neous
noise clean music
![Page 15: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/15.jpg)
Information of speech & text corpora
Attribute Speech corpusText
corpus
No. of sentences
13k 32k
No. of words 224k 573k
No. of unique words
10k 14k
No. of phonemes
899k -
No. of speakers8 female,
4 male-
15
![Page 16: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/16.jpg)
Data used in experiments Test set data
Randomly selected from the speech corpus 3,000 utterances
Acoustic model training data for the baseline system Phonetically balanced sentence speech corpora
LOTUS (Kasuriya et al., 2003) and the corpus developed internally
Read speech corpora 40.3 hours (68 male and 68 female)
Acoustic model adaptation data Selected from the speech corpus No overlap between adaptation data and test set
data
Language model training data Text corpus + transcript from speech corpus
excluded test set
16
![Page 17: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/17.jpg)
Experimental condition
Acoustic model Gender-dependent acoustic model 12 MFCCs, delta, and delta energy Triphones, 1000 tied-states, 8 Gaussian mixtures
Language model Tri-grams
Dictionary size: about 18k words TITech WFST speech recognition system
(Dixon et al., 2007) was used as a speech decoder
17
![Page 18: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/18.jpg)
Acoustic model adaptation
Supervised adaptation using MLLR F-condition adaptation
F0 : clean, planned F1 : clean, spontaneousF3 : music noise F4 : other noise
Adaptation data: 200 utterances regardless of speaker randomly selected from the speech corpus
Speaker adaptation Adaptation data: 200 utterances regardless of
F-condition randomly selected from the speech corpus
18
![Page 19: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/19.jpg)
F0 F1 F3 F4 Overall20
24
28
32
36
40
44
48
52
56
60
26.0
43.6
56.4
41.5 38.4
22.3
35.9 38.4
34.2 30.8
21.8
36.9 38.0
31.9 29.1
No adaptation F-condition adapt. Speaker adapt.
WER
(%
)WER results
19
Speaker adaptation yielded
better WER
F-condition
Proportion
Time #words
F0 35.3% 17160
F1 1.0% 629
F3 14.0% 7882
F4 49.7% 27542
![Page 20: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/20.jpg)
Discussion
High WER Mismatch recording condition
The speech corpus was only used as testing and adaptation data
Small text corpus Inefficient language model
20
![Page 21: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/21.jpg)
Conclusion
Construction of the first Thai broadcast news corpus and overview of the corpus analysis was presented
Speech corpus was annotated with structure information which is useful for further research purpose
An LVCSR system was setup and tested with the corpus
21
![Page 22: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/22.jpg)
Future work
Applying our Thai language modeling technique (Jongtaveesataporn et al., 2007) Compound pseudo-morpheme (CPM) unit Pseudo-morpheme error rate (F0 condition)
Manually-segmented word unit system: 20.5% CPM unit system: 19.9%
Improving language model by using newspaper text
Collaboration with NECTEC: additional 50 hours of speech corpus
22
![Page 23: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/23.jpg)
Thank you
23
![Page 24: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/24.jpg)
Thank you
24
![Page 25: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/25.jpg)
Thank you
25
![Page 26: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/26.jpg)
Background
26
1987
Isolated syllable recogniti
on
1995
Isolated word
recognition
Connected sub-word
recognition
1999
Small task continuous
speech recognition
2003
LVCSR
2005
Broadcast
news LVCSR
2007
Difficulty
Thienlikit, 2004• Newspaper read-speech recognition
![Page 27: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/27.jpg)
Development of Thai Broadcast News LVCSR System Development of an LVCSR system requires
speech and text corpora Existing speech corpora for Thai LVCSR
research NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU)
27
Newspaper read-speech
1. Development of Thai broadcast news corpus• Speech corpus: training and testing
data• Text corpus: language modeling
2. Development of a prototype of LVCSR system
![Page 28: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/28.jpg)
Experiments & Developed corpora Speech corpus
The size of the speech corpus is still rather small
It was used in three ways Test data Adaptation data A part of transcription text was used for
training LM
Text corpus It was used for training LM
28
![Page 29: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/29.jpg)
Perplexity & OOV rates
F-conditio
n
Perplexity OOV rate
Male Female Male Female
F0 107.5 106.9 0.9 0.8
F1 126.4 100.1 0.9 0.6
F3 145.2 100.0 0.7 0.9
F4 141.6 157.6 1.5 1.9
Overall 126.9 125.6 1.2 1.3
29
![Page 30: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/30.jpg)
Transcription processText corpus transcribing7 persons
Guideline
30
Speech corpus transcribing4 persons
Speech corpus checking2 persons
Lexical entries checking1 person
Speech corpus
Lexical entries checking1 person
Text corpus
![Page 31: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/31.jpg)
Speech corpus
Transcription and annotation of about 17 hours of TV broadcast news
Tool: “Transcriber” (Barras et al., 2001)
Additional information speaker information: name, gender speaking mode: planned/spontaneous
speech Speech from announcers speaking in
the studio31
![Page 32: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/32.jpg)
Transcription conventions
Guideline for the transcription process Segment segmentation Word segmentation Repeating word Thai/English abbreviation Number entity Special tags
32
![Page 33: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/33.jpg)
Introduction
Thai speech processing research in TokyoTech Dialogue system [Whittiwiwattchai, 2003] LVCSR system
Dictation system [Tianlikid,2005] Broadcast news recognition system
33
![Page 34: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/34.jpg)
Overview
Introduction Corpus description Recording and transcription
processes Corpus evaluation Conclusion
34
![Page 35: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/35.jpg)
Thai language corpora
Large language corpora are crucial to a state-of-the-art natural language processing system
Thai speech resources for speech processing NECTEC-ATR LOTUS (NECTEC) GlobalPhone (CMU) TSynC-1 (NECTEC)
35
Newspaper read-speech
Unit-selection speech synthesis
![Page 36: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/36.jpg)
WER Result
F-conditionTime
proportion
WER (%)
Male Female
F0 28.1% 44.4 40.8
F1 1.5% 62.4 60.2
F3 11.5% 82.2 72.4
F4 58.9% 54.9 57.5
Overall 100% 56.8 45.5
36
![Page 37: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/37.jpg)
Text corpus
Text transcribed from 35 hours of TV broadcast news
Additional information Speaking mode: planned/spontaneous
37
![Page 38: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/38.jpg)
Transcription conventions (1) Sentence segmentation
No sentence marker in Thai language Ambiguous Grammatically, there are 3 types of
sentence Simple sentence Compound sentence Complex sentence
Sentence was defined as a simple sentence or clause with the help of delimited breaths
38
Composed from several of clauses or simple sentences
![Page 39: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/39.jpg)
Transcription conventions (2) Word segmentation
No word boundary marker in Thai language
Lead to difficulties in transcription and data checking processes
Too ambiguous to define all rules A few rules of simple segmentation
patterns were defined Undefined patterns were left to the
decision of transcribers
39
![Page 40: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/40.jpg)
Transcription conventions (3) Repeating word Thai/English abbreviation Number entity Special tags
Disfluencies, filled-pauses, exclamations Foreign words Some other events: uncertainly
transcribed part, etc.
40
![Page 41: Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand](https://reader030.vdocuments.us/reader030/viewer/2022032516/56649c7b5503460f9492f126/html5/thumbnails/41.jpg)
Recorded programs
News programs from one public TV station in Thailand was recorded
Total of 105 news episodes Speech corpus
35 news episodes About 17 hours of speech data
Text corpus: 70 news episodes
41