transcripts are stored in a relational database transcripts are divided up to their smallest...

9
Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a structure basically like this: interview_i d interview speaker_id sentence_id word_id speaker sentence word interview_id speaker_id sentence_id locality start time end time DynaSAND: technology

Upload: sara-ferguson

Post on 04-Jan-2016

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

• Transcripts are stored in a relational database

• Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a structure basically like this:

interview_id

interview

speaker_id sentence_id

word_id

speaker sentence word

interview_id speaker_id sentence_id

locality start timeend time

DynaSAND: technology

Page 2: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

DynaSAND: technology

•This means that individual words can be addressed, e.g. for POS tagging

•The POS tags are themselves stored as separate categories, attributes and values, not as opaque strings:

attribute_idvalue_idword_id

category_id

word_id

word_id

word

category

attributes

Page 3: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Generating other formats•The fact that the data is stored in its

smallest constituent parts makes it relatively easy to generate other formats

•Example: we realize that a binary format like a relational database is not appropriate for long-term archival, so we made the SAND transcriptions available as TEI XML by creating a template and filling that with data from the database with a script

•Another example: the IMDI metadata for another corpus (The Goeman-Taeldeman-Van Reenen Project, or GTRP corpus) were created in the same way

Page 4: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Generating metadata for CLARIN

•Previous experience with SAND and GTRP indicates that generating XML metadata for CLARIN from our databases should be doable

•The TEI and IMDI for SAND and GTRP were created once and are static; we plan to make the process more dynamic for CLARIN metadata by creating the XML on the fly (and implementing a caching mechanism for performance reasons) so that the metadata is always up to date

Page 5: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Edisyn (European Dialect Syntax)

•One of the goals of Edisyn is the development of a search engine which uses one tag set to search different corpora, including the SAND, concurrently

•Central tag set is being developed by Franca Wesseling; we plan to make it compatible with ISOcat

•Search engine translates these tags to the native tag sets of the corpora

•Ideal case: corpora are hosted by their own organizations and accessible via a web service

•In practice: the Meertens has local copies of the corpora

•Participating corpora: SAND, CORDIAL-SIN (Portuguese), ASIS (Italian), EMK (Estonian); more to come

Page 6: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Other Meertens language resources

•PLAND (Plant Names in Dutch Dialects)•NVD (Dutch Database of First Names)•NFD (Dutch Database of Family Names)•Corpus of free dialect speech (sound

recordings)•Dutch Database of Toponyms (in

development)•Dutch Song Database•Dutch Folktale Database

Page 7: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Other Meertens language resources

•Apart from part of the sound recordings, all these are web-based and based on the same database technology

•We plan to make CLARIN metadata available for these resources in a stepwise manner: first metadata on the corpus level, later also metadata on the record level

•The technologies involved (OAI-PMH) are new to us, so we want to do this in close cooperation with a “harvesting” institution to make sure that our stuff is correct

Page 8: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Further in the future

•The Meertens Institute wants to be part of CLARIN and in the future we also hope to contribute to the development of tools to work with language resources

Page 9: Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a

Thank you for your attention!