spoken speech corpora shu-chuan tseng academia sinica january 2002

24
Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

Upload: barnaby-white

Post on 26-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

Spoken Speech Corpora

Shu-Chuan Tseng

Academia SinicaJanuary 2002

Page 2: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

2January 2002

Contents

• Speech Corpora and Annotation

• Related Research Issues

• Transcribing and Annotating Mandarin Spontaneous Dialogues

• Interface and Annotation Tags

Page 3: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

3January 2002

State of the Art – Speech Corpora

• Air Travel Information System. Spoken Language Systems Pilot Corpus (ATIS, h-c, English, MADCOW 1992)http://www.ldc.upenn.edu/Catalog/LDC93S4A.html

http://morph.ldc.upenn.edu/readme_files/atis/sspcrd/corpus.html

• SRI’s Amex Travel Agent Data (AMEX, h-h, English, Kowtko & Price 1989)66 conversations

http://www.ai.sri.com/~communic/amex/amex.html

• Switchboard Corpus (SWBD, h-h, English, Godfrey, Holliman and McDaniel 1992)2400 conversations

http://stripe.colorado.edu/~jurafsky/manual.august1.html

Page 4: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

4January 2002

State of the Art - Speech Corpora

• HCRC Map Task Corpus (h-h, English, Anderson et al. 1991)64 subjects

http://www.hcrc.ed.ac.uk/dialogue/maptask.html

• BAUFIX (h-h, h-c, German, Sagerer et al. 1994, Brindöpke et al.1995)h-h: 22 dialogues, h-c: 32 dialogues

http://www.sfb360.uni-bielefeld.de/transkript/

• TRAINS 93 (h-h, English, Heeman & Allen 1995)20 different tasks, 34 different speakers, 6.5 hrs, 5900 turns and 55000 transcribed words

http://www.cs.rochester.edu/research/cisd/projects/trains/

• Pattern-Description Monologues (h, Dutch, Levelt 1983)53 different patterns, 53 subjects

Levelt

Page 5: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

5January 2002

State of the Art - Speech Annotation• Transliteration:

- audio data => written transcripts - systems: verbatim, annotated, shortened/cleaned, conversation acts,

suprasegmental- contents:

words, boundaries, tones, non-speech sequences, discourse- and syntax-related roles

• Labelling = Transcripts + Time-aligned Signals- raw acoustical data => time-aligned, segmented acoustical data- tiers: phone, phoneme, syllable, word, boundary & tone

Page 6: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

6January 2002

Transcription: Turns (BAUFIX)01K020 ach so, {ich}<spk: I, s/> dachte, die wären auf der anderen Seite.

<hum: lachen> <noise> war gerade am überlegen ist das </noise: rascheln> <par> <noise> nicht (ei)n bißchen kurz </noise: rascheln> </par: 2>

01I022 <par> also nicht über Kreuz </par: 2> <noise> sondern wirklich so gerade übereinander. </noise: rascheln>

01K021 {ja, ja}<noise: klappern>

01I023 {mhm}<noise: klappern>

01K022 <noise> <sil: 48> hm, irgendwie geht das nicht so toll fest. <hum: lachen> </noise: klappern>

01I024 <noise> <sil: 1> {hält das nicht?}<spk: K, ?> </noise: klappern>

01K023 <noise> <sil: 1> hm, nee dieses eine Rautenteil war zu klein. das ging <quest: da> nicht drüber. <hum: lachen> ja <-> und nun? </noise: rascheln>

01I025 <noise> <hum: atmen> ja und du hast also jetzt diese <hum: atmen> {äh ja}<hum: atmen> diese benzolförmigen Dinger sind jetzt sag(e) ich mal oben <-> und auch oben <-> auf diesen <-> beiden Platten ist jetzt dieser <--> Würfel. </noise: rascheln>

01K024 {mhm}<noise: rascheln>

Page 7: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

7January 2002

Transcription: Conversation Acts (TRAINS, Traum/Hinkelman 1992)

DU

(Discourse Unit)UU

(Utterance Unit)Sub UU

Page 8: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

8January 2002

Prosodic Annotation• ToBI: Tones and Break Indices Pan-Mandarin ToBI System (http://deall.ohio-state.edu/chan.9/MToBI.htm)

Page 9: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

9January 2002

Related Issues: Lexical Distribution

0

50

100

150

200

250

300

350

400

Nu

mbe

r of

Typ

es

90 80 70 60 50 40 30 20 10 9 8 7 6 5 4 3 2 1

Word Frequency

Page 10: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

10January 2002

Related Issues: turn-initial words

D1-A D1-B D2-A D2-B D3-A D3-B

嗯 en他 ta哎 ai對 dui嗷 ou她 ta那 na我 wo有 you好像haoxiang

371097654444

嗯 en那 na哎 ai對 dui嗷 ou他 ta我 wo她 ta就 jiu哦 o

45987765544

嗯 en嗷 ou嗨 hai這樣子zheyangzi哦 o那 na對 dui嗯嗯 enen真的 zhende哎 ai

18877

665555

嗯 en對 dui是 shi哎 ai哦 o呵 he嗷 ou啊 a我 wo呃 e

9888775555

嗨 hai那 na嗷 ou我 wo對啊 duia哎 ai對 dui嗯 en這樣 zheyang哦 o

22187766444

4

嗯 en那 na啊 a嗷 ou哎 ai我 wo呃 e對啊 duia就 jiu哦 o

181796544333

Page 11: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

11January 2002

Related Issues: Repair Types

Type Occurrence Type OccurrenceRepetition 202 Addition/Substitution 1Substitution 46 Addition/Repetition 5 Addition 43 Deletion/Repetition 1Deletion 9 Repetition/Addition 10

Repetition/Addition/Sub 1Repetition/Deletion 1Repetition/Substitution 3Substitution/Repetition 3

Page 12: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

12January 2002

Related Issues: POS in Chinese Repairs 

POS Abbreviation OccurrencesVerb V 258Noun N 521Preposition P 29Adverbial D 322Conjunction C 23Particle T 22Interjection I 20Non-predicate Adj. A 1Foreign Word FW 2Verb: be SHI 27

Page 13: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

13January 2002

Related Issues: POS (Repairs vs. Overall Data)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

V N P D C T I A b SHI FW

POS Tags

Per

cent

age

Repair

Overall

Page 14: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

14January 2002

Related issues: Prosodic Signalling• reset hypothesis; baseline declination; intonation units• pitch contour; location of editing terms• duration

在 家 的 在 裡面 的 哦 zai jia de zai limian de O At home, inside O

Page 15: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

15January 2002

Related Issues: Intonation Unit vs. Repair

Page 16: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

16January 2002

Collecting Speech Data

• Types of Speech: read, spontaneous, monologues, dialogues• Scenario Design: daily conversation, direction-giving, instruction,

pattern-description, task-oriented, topic-oriented• Selection of Subjects: age, linguistic and social background, gender,

education• Recording: digital audio tape, MD, video, eye-tracking• Transcription: orthographic transcript, discourse-related function or

annotation, intonation-units• Labelling: phonemic, word, prosodic• Documentation: subjects, recording device, annotation system

Page 17: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

17January 2002

Building a Large Mandarin Dialogue Corpus• Content, Size, Style, Topic, Subject (corpus setting)• Dialogue Transcription (computer-aided)• Transcription Programme (speaker, sound file, tags, time,

content)• Convention Systems (depending on research directions)• Content Annotation (computer-aided, automatically merged

into database)• Database Construction (database in Access format)• Speech Labelling (sound file index and time-alignment and

in database available)

Page 18: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

18January 2002

Statistics of Corpus• 30 conversational dialogues (37 female/23 male) • Age 16-25: 20 (15f/5m), age 26-35: 19 (9f/10m), age 36-45:

21 (13f/8m)• Total length: 26.5 hr.; each dialogue is about 50 min. long• Topics: family, work, economics, politics, movie, education,

exams, language learning, TV, internet, internet café, children, childhood, traveling, music, jobs, school, dialect, social problems, environmental problems, colleagues, personal experience, computer, traffic, marriage, China, Taipei MRT etc.

Page 19: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

19January 2002

Computer-Aided Transcription

• transcribing conversation in Pinyin and in Chinese characters

• including transcriber and subject information• documenting location of audio files• marking start and end time of speech segment in

corresponding audio files• inserting flexible tags to annotate linguistic features• outputting data in database format

Page 20: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

20January 2002

Illustration - Interface

Page 21: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

21January 2002

Illustration - Database

Page 22: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

22January 2002

Extragrammatical Sequences in Human-Human Conversation

• disfluency- prosodic, repair, syntactic, pragmatic

• socio-linguistic phenomena- code switching, new words

• particular vocalisation- lengthening, assimilation, syllable contraction

• unintelligible and non-speech sounds

Page 23: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

23January 2002

Annotation Taglist - I

Page 24: Spoken Speech Corpora Shu-Chuan Tseng Academia Sinica January 2002

24January 2002

Annotation Taglist -II