the c-oral-brasil corpus (raso & mello 2009; 2010) annotation system encompases a three level...

1
The C-ORAL-BRASIL corpus (Raso & Mello 2009; 2010) annotation system encompases a three level tagging arrangement as follows: 1. morphosyntactic automatic annotation through the parser PALAVRAS implemented to consider utterance and information units as its domain of application (Bick 2000); 2. informational annotation (Cresti 2000); 3. illocutionary annotation (Cresti 2005). C-ORAL-BRASIL THE MORPHOSYNTACTIC ANNOTATION The morphosyntactic annotation of the corpus was carried by a robust Constraint Grammar (CG) parser for Portuguese, PALAVRAS (Bick 2000), which as a rule-based system allows the systematic adaptation to very different types of data. Like historical texts (Bick 2005), transcribed speech (Bick 1998) poses two main problems for automatic grammatical analysis, one being non-standard orthography, the other non- standard segmentation, the former affecting lexical recall, the latter creating problems for contextual disambiguation. To overcome these problems, we introduced a two-level markup as a preprocessing stage, where prosodical information, speaker overlap, repairs etc were maintained at a meta- annotation level, while at the same time creating a layer of standardized written-text token sequence for the parser to work on. To support this process, our program chain had access to a lexicon extension as well as a number of systematical grammatical transformations (e.g. missing person-number inflexion, clitic subject forms, plural interjections etc.). The segmentation problem was addressed both at the token level, with new rules for contractions and focus markers, and at the syntactic level, by treating prosodic breaks as punctuation (e.g. //as full stop and / as comma). Though providing deep structural information, such as syntactic function and dependency, our grammatical annotation is strictly word- (token-) based and allows the easy integration into databases and tag-based corpus search interfaces such as CorpusEye <corp.hum.sdu.dk>. REFERENCES Bick, E.Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese <http://beta.visl.sdu.dk/pdf/nordling98.ps.pdf >, in: /Proceedings of the 17th Scandinavian Conference of Linguistics, Odense,1998. Bick, E.The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press. 2000. Bick, E. & Módolo,M. Letters and Editorials: A grammatically annotated corpus of 19th century Brazilian Portuguese < http://www.corpora-romanica.net/publications_f.htm >. In: Claus Pusch & Johannes Kabatek & Wolfgang Raible (eds.) /Romance Corpus Linguistics II: Corpora and Historical Linguistics (Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, Sept. 2003)/. pp. 271-280. Tübingen: Gunther Narr Verlag.2005. Cresti, E. Corpus di italiano parlato. Firenze: Accademia della Crusca. 2000. Cresti, E. Per una nuova classificazione dell’illocuzione. In: Burr, E. (ed.). Tradizione & Innovazione. Linguistica e filologia italiana alle soglie dì un nuovo millennio. Atti del VI° Convegno SILFI. Firenze: Franco Cesati, 3 volumi. 2005. Cresti, E. & Moneglia, M. C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. Amsterdam-Philadelphia: John Benjamins, 2005. Moneglia, M. & Cresti, E. L’intonazione e i criteri di trascrizione del parlato adulto e infantile. In: BortoliniI, U., Pizzuto, E. Il Progetto CHILDES Italia. Pisa: Del Cerro, 1997, p. 57-90. Raso, T. & Mello, H. Parâmetros de compilação de um corpus oral: o caso do C-ORAL-BRASIL, Em: Veredas, 2009, p. 20-35. Raso, T. & Mello, H. The C-ORAL-BRASIL corpus. In: Moneglia, M.; Panunzi, A., (eds.) Bootstrapping Information from Corpora in a Cross Linguistic Perspective. Firenze University Press, 2010. C-ORAL-BRASIL <www.c-oral-brasil.org> CORPUSEYE <corp.hum.sdu.dk> LABLITA <lablita.di.unifi.it> THE INFORMATIONAL ANNOTATION A 200 word minicorpus (more than 30,000 words) was informationally tagged based on the Informational Patterning Theory (Cresti 2000). An interface that allows for the extraction of morphosyntactic and informational configurations in a reciprocal relation will be implemented. For example: it will be possible to extract all Topics which are NPs configurationally on the one hand, or all modal verbs within Topic units on the other. Annotation parameters for a spontaneous speech corpus Eckhard Bick - University of Southern Denmark Heliana Mello – Universidade Federal de Minas Gerais Tommaso Raso – Universidade Federal de Minas Gerais PALAVRAS LABLITA THE C-ORAL-BRASIL CORPUS The corpus represents Brazilian Portuguese spontaneous speech events, with the same criteria as the C-ORAL-ROM corpora (Cresti-Moneglia 2005) for European Portuguese, French, Italian and Spanish. Diaphasic variation is the main goal of corpus architecture: informal vs. formal; informal: private vs. public; for each corpus half: 1/3 dialogues, 1/3 conversations, 1/3 monologues. Maximal diaphasic diversity: people grocery shopping and shoe shopping; construction worker and an engineer at a construction site; driving lesson; people playing pool, soccer and different table games; people cooking or cleaning the kitchen or the house; people working at the computer; a student helping another one with a recorder; driver and passenger talking in a car; waiters waiting at a party; drag-queens putting on make up before a show; a mother telling a story to her child; people telling dramatic moments of their life or explaining their job; jokes; recipes, and many other different situations. Each recorded session is stored in wav files (Windows PCM, 22050Hz. 16 bit). The C-ORAL-BRASIL corpus provides the acoustic source of each session together with the following main annotations: Session metadata, in CHAT and IMDI formats. Synchronization of each transcribed utterance to the acoustic source, in XML files, via the WinPitch Software (© Pitch France www.winpitch.com ). The orthographic transcription, in CHAT format, enriched with the tagging of utterance terminal (//) and within utterance non terminal (/) prosodic breaks, in TXT files (Moneglia & Cresti 1997). Utterance is defined as the smallest unity of speech with pragmatic autonomy, i. e. a speech act. LEEL Laboratório de Estudos Empíricos e Experimentais da Linguagem Example: Topic search in the WinPitch interface THE ILLOCUTIONARY ANNOTATION The last annotation step is the attribution of illocutionary value to Comment units (these, by definion, carry illocutionary force) within each utterance. Both the informational and the illocutionary annotations are based predominantly on prosodic parameters. The corpus portal interface will allow for the search of a three layer annotation system which relates morphosyntactic structures, informational units and utterance illocutionanal values. Example: Illocutionary annotation *FLA: cê vai embora que dia /=COM= Rena /=ALL= % ill: total question Example: Morphosyntactic annotation Example: three utterance sequence

Upload: roxanne-esther-wilkinson

Post on 30-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The C-ORAL-BRASIL corpus (Raso & Mello 2009; 2010) annotation system encompases a three level tagging arrangement as follows: 1. morphosyntactic automatic

The C-ORAL-BRASIL corpus (Raso & Mello 2009; 2010) annotation system encompases a three level tagging arrangement as follows: 1. morphosyntactic automatic annotation through the parser PALAVRAS implemented to consider utterance and information units as its domain of application (Bick 2000); 2. informational annotation (Cresti 2000); 3. illocutionary annotation (Cresti 2005).

C-ORAL-BRASIL

THE MORPHOSYNTACTIC ANNOTATION

The morphosyntactic annotation of the corpus was carried by a robust Constraint Grammar (CG) parser for Portuguese, PALAVRAS (Bick 2000), which as a rule-based system allows the systematic adaptation to very different types of data. Like historical texts (Bick 2005), transcribed speech (Bick 1998) poses two main problems for automatic grammatical analysis, one being non-standard orthography, the other non-standard segmentation, the former affecting lexical recall, the latter creating problems for contextual disambiguation. To overcome these problems, we introduced a two-level markup as a preprocessing stage, where prosodical information, speaker overlap, repairs etc were maintained at a meta-annotation level, while at the same time creating a layer of standardized written-text token sequence for the parser to work on.

To support this process, our program chain had access to a lexicon extension as well as a number of systematical grammatical transformations (e.g. missing person-number inflexion, clitic subject forms, plural interjections etc.). The segmentation problem was addressed both at the token level, with new rules for contractions and focus markers, and at the syntactic level, by treating prosodic breaks as punctuation (e.g.//as full stop and / as comma).

Though providing deep structural information, such as syntactic function and dependency, our grammatical annotation is strictly word- (token-) based and allows the easy integration into databases and tag-based corpus search interfaces such as CorpusEye <corp.hum.sdu.dk>.

REFERENCESBick, E.Tagging Speech Data - Constraint Grammar Analysis of Spoken Portuguese <http://beta.visl.sdu.dk/pdf/nordling98.ps.pdf>, in: /Proceedings of the 17th Scandinavian Conference of Linguistics, Odense,1998.Bick, E.The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Aarhus: Aarhus University Press. 2000.Bick, E. & Módolo,M. Letters and Editorials: A grammatically annotated corpus of 19th century Brazilian Portuguese <http://www.corpora-romanica.net/publications_f.htm>. In: Claus Pusch & Johannes Kabatek & Wolfgang Raible (eds.) /Romance Corpus Linguistics II: Corpora and Historical Linguistics (Proceedings of the 2nd Freiburg Workshop on Romance Corpus Linguistics, Sept. 2003)/. pp. 271-280. Tübingen: Gunther Narr Verlag.2005.Cresti, E. Corpus di italiano parlato. Firenze: Accademia della Crusca. 2000. Cresti, E. Per una nuova classificazione dell’illocuzione. In: Burr, E. (ed.). Tradizione & Innovazione. Linguistica e filologia italiana alle soglie dì un nuovo millennio. Atti del VI° Convegno SILFI. Firenze: Franco Cesati, 3 volumi. 2005.Cresti, E. & Moneglia, M. C-ORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. Amsterdam-Philadelphia: John Benjamins, 2005. Moneglia, M. & Cresti, E. L’intonazione e i criteri di trascrizione del parlato adulto e infantile. In: BortoliniI, U., Pizzuto, E. Il Progetto CHILDES Italia. Pisa: Del Cerro, 1997, p. 57-90. Raso, T. & Mello, H. Parâmetros de compilação de um corpus oral: o caso do C-ORAL-BRASIL, Em: Veredas, 2009, p. 20-35.Raso, T. & Mello, H. The C-ORAL-BRASIL corpus. In: Moneglia, M.; Panunzi, A., (eds.) Bootstrapping Information from Corpora in a Cross Linguistic Perspective. Firenze University Press, 2010.C-ORAL-BRASIL <www.c-oral-brasil.org>CORPUSEYE <corp.hum.sdu.dk> LABLITA <lablita.di.unifi.it>

THE INFORMATIONAL ANNOTATION

A 200 word minicorpus (more than 30,000 words) was informationally tagged based on the Informational Patterning Theory (Cresti 2000). An interface that allows for the extraction of morphosyntactic and informational configurations in a reciprocal relation will be implemented. For example: it will be possible to extract all Topics which are NPs configurationally on the one hand, or all modal verbs within Topic units on the other.

Annotation parameters for a spontaneous speech corpus

Eckhard Bick - University of Southern Denmark

Heliana Mello – Universidade Federal de Minas Gerais

Tommaso Raso – Universidade Federal de Minas Gerais

Eckhard Bick - University of Southern Denmark

Heliana Mello – Universidade Federal de Minas Gerais

Tommaso Raso – Universidade Federal de Minas Gerais

PALAVRAS LABLITA

THE C-ORAL-BRASIL CORPUS

The corpus represents Brazilian Portuguese spontaneous speech events, with the same criteria as the C-ORAL-ROM corpora (Cresti-Moneglia 2005) for European Portuguese, French, Italian and Spanish. Diaphasic variation is the main goal of corpus architecture: • informal vs. formal; • informal: private vs. public;• for each corpus half: 1/3 dialogues, 1/3 conversations, 1/3 monologues.Maximal diaphasic diversity: people grocery shopping and shoe shopping; construction worker and an engineer at a construction site; driving lesson; people playing pool, soccer and different table games; people cooking or cleaning the kitchen or the house; people working at the computer; a student helping another one with a recorder; driver and passenger talking in a car; waiters waiting at a party; drag-queens putting on make up before a show; a mother telling a story to her child; people telling dramatic moments of their life or explaining their job; jokes; recipes, and many other different situations.

Each recorded session is stored in wav files (Windows PCM, 22050Hz. 16 bit). The C-ORAL-BRASIL corpus provides the acoustic source of each session together with the following main annotations:• Session metadata, in CHAT and IMDI formats. Synchronization of each transcribed utterance to the acoustic source, in XML files, via the WinPitch Software (© Pitch France www.winpitch.com). •The orthographic transcription, in CHAT format, enriched with the tagging of utterance terminal (//) and within utterance non terminal (/) prosodic breaks, in TXT files (Moneglia & Cresti 1997). Utterance is defined as the smallest unity of speech with pragmatic autonomy, i. e. a speech act.

LEELLaboratório de Estudos Empíricos e Experimentais da Linguagem

Example: Topic search in the WinPitch interface

THE ILLOCUTIONARY ANNOTATION

The last annotation step is the attribution of illocutionary value to Comment units (these, by definion, carry illocutionary force) within each utterance. Both the informational and the illocutionary annotations are based predominantly on prosodic parameters. The corpus portal interface will allow for the search of a three layer annotation system which relates morphosyntactic structures, informational units and utterance illocutionanal values.

Example: Illocutionary annotation

*FLA: cê vai embora que dia /=COM= Rena /=ALL=% ill: total question

Example: Morphosyntactic annotation

Example: three utterance sequence