language data and xml: archiving and interoperability

25
Language data and XML: archiving and interoperability Simon Musgrave Linguistics Program Monash University ([email protected])

Upload: melinda-rowe

Post on 01-Jan-2016

35 views

Category:

Documents


2 download

DESCRIPTION

Language data and XML: archiving and interoperability. Simon Musgrave Linguistics Program Monash University ([email protected]). Language documentation. Language documentation produces large quantities of text Transcribed language events associated annotations - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language data and XML: archiving and interoperability

Language data and XML:archiving and interoperability

Simon MusgraveLinguistics Program

Monash University

([email protected])

Page 2: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Language documentation

• Language documentation produces large quantities of text– Transcribed language events– associated annotations – lexica / dictionaries – analyses – ethnographic notes– …….

• There is no standard software tool used by linguists

• Use of proprietary software results in file formats with limited portability

Page 3: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Advantages of XML: Archiving

• UNICODE compatibility assured– Besides script possibilities, access to the full

International Phonetic Alphabet character set is important for linguists

• Explicit coding of data model

• Generic file format assures better portability and lifespan

Page 4: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Building an archive

• Addition of data to an XML archive should be automated

• This implies the existence of transformation scripts to move data between formats

• Creating these scripts is work which has to be done

• It can have a second benefit

Page 5: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Advantages of XML: Interoperability

• Members of a research team may use different software running on different platforms

• Problems can arise in sharing data• An important use of XML is as an

interchange format• Transformation scripts created for

archiving can also be used for sharing data

Page 6: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Data structures - 1

• Researchers may not agree on common data structures– They are used to working with one tool in one

particular way– Their interests are different

• Even if they agree on a data structure for current work, heritage data may have to be imported to the archive

Page 7: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Data structures - 2

• Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data

• We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure

• Where possible, correspondences will be made between the information in different input files

Page 8: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Example: Dictionary files

• The prototype implementation of the process uses a simple type of information: dictionary files

• Source 1 is a FilemakerPro database of lexical material from the language Nusalaut

• Source 2 is a table in an Access database containing data from several languages

Page 9: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Source 1

Page 10: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Source 2

Page 11: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Process overview

Page 12: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Stage 1 – txt to xml

• Data exported from database as delimited text file

• A document type description (DTD) is created for each source file– This replicates the existing data structure,

possibly with additions

• A Perl script reads data from the txt file and adds tags based on the DTD

Page 13: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Sample: specific XML

Page 14: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Stage 1 – Why?

• Newer versions of commercial software offer an export to XML facility

• Importing data from a normalized database often means having access to data from more than one table– XSLT takes a single input file– Perl (or an equivalent) does not have this

limitation

• Type conversion can be done using Perl

Page 15: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Stage 2 – XML1 to XML2

• DTD for archive file has a place for all information in all input files

• More structure imposed at this level– Stage 1 used only elements – Stage 2 uses attributes, mainly for metadata– “Pseudo-normalization”: recurring data substructures

treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs

• Date stamping done at this stage

Page 16: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Sample: General XML 1

Page 17: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Sample: General XML 2

Page 18: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Exporting Data

• XSLT with <xsl:output method=“text”/>

• The only complication is undoing “pseudo-normalization”

Page 19: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

A more complex problem: aligned interlinear text

• Important way of presenting data for linguists• Various lines of annotation, different levels have

different alignment patterns

Page 20: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

The Bird, Bow & Hughes Model

• Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop

• A general data model for representing this type of information

• Four levels:– Text– Phrase– Word– Morpheme

Page 21: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

XML model for aligned text

Page 22: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Aligned text: Problems

• Various types of input:– Text strings with space and/or tabs (Shoebox)– Formatted text (e.g. Word tables)– Structured data (e.g. Spinoza database)

• Type of processing varies– Text strings need a lot of parsing– Structured data needs access to multiple tables

• Ideally, time alignment to AV source should be included also

Page 23: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

What is gained

• Interoperability within the project– Data can be imported to the archive file from one

format and exported to another format• Interoperability outside the project

– People who wish to share data with a group will define transformations from their data formats

– A bottom-up approach to developing standards• Improved data modeling

– Encourages members of the project to revise their data formats

– Gives us help in developing high-level models for linguistic data

Page 24: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Future work

• Processing aligned text formats

• Using schemas rather than DTDs: data validation

• Improved version control, especially checking for duplicate or conflicting records

Page 25: Language data and XML: archiving and interoperability

DRH 2003 - Cheltenham 2/9/03

Some details

• This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora

• Funding:– Hans Rausing Endangered Languages Project – Australian Research Council– Faculty of Arts, Monash University

• Contacts:– [email protected]– http://www.arts.monash.edu.au/ling/maluku