language data and xml: archiving and interoperability

Language data and XML:archiving and interoperability

Simon MusgraveLinguistics Program

Monash University

([email protected])

DRH 2003 - Cheltenham 2/9/03

Language documentation

• Language documentation produces large quantities of text– Transcribed language events– associated annotations – lexica / dictionaries – analyses – ethnographic notes– …….

• There is no standard software tool used by linguists

• Use of proprietary software results in file formats with limited portability


Advantages of XML: Archiving

• UNICODE compatibility assured– Besides script possibilities, access to the full

International Phonetic Alphabet character set is important for linguists

• Explicit coding of data model

• Generic file format assures better portability and lifespan


Building an archive

• Addition of data to an XML archive should be automated

• This implies the existence of transformation scripts to move data between formats

• Creating these scripts is work which has to be done

• It can have a second benefit


Advantages of XML: Interoperability

• Members of a research team may use different software running on different platforms

• Problems can arise in sharing data• An important use of XML is as an

interchange format• Transformation scripts created for

archiving can also be used for sharing data


Data structures - 1

• Researchers may not agree on common data structures– They are used to working with one tool in one

particular way– Their interests are different

• Even if they agree on a data structure for current work, heritage data may have to be imported to the archive


Data structures - 2

• Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data

• We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure

• Where possible, correspondences will be made between the information in different input files


Example: Dictionary files

• The prototype implementation of the process uses a simple type of information: dictionary files

• Source 1 is a FilemakerPro database of lexical material from the language Nusalaut

• Source 2 is a table in an Access database containing data from several languages


Source 1


Source 2


Process overview


Stage 1 – txt to xml

• Data exported from database as delimited text file

• A document type description (DTD) is created for each source file– This replicates the existing data structure,

possibly with additions

• A Perl script reads data from the txt file and adds tags based on the DTD


Sample: specific XML


Stage 1 – Why?

• Newer versions of commercial software offer an export to XML facility

• Importing data from a normalized database often means having access to data from more than one table– XSLT takes a single input file– Perl (or an equivalent) does not have this

limitation

• Type conversion can be done using Perl


Stage 2 – XML1 to XML2

• DTD for archive file has a place for all information in all input files

• More structure imposed at this level– Stage 1 used only elements – Stage 2 uses attributes, mainly for metadata– “Pseudo-normalization”: recurring data substructures

treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs

• Date stamping done at this stage


Sample: General XML 1


Sample: General XML 2


Exporting Data

• XSLT with <xsl:output method=“text”/>

• The only complication is undoing “pseudo-normalization”


A more complex problem: aligned interlinear text

• Important way of presenting data for linguists• Various lines of annotation, different levels have

different alignment patterns


The Bird, Bow & Hughes Model

• Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop

• A general data model for representing this type of information

• Four levels:– Text– Phrase– Word– Morpheme


XML model for aligned text


Aligned text: Problems

• Various types of input:– Text strings with space and/or tabs (Shoebox)– Formatted text (e.g. Word tables)– Structured data (e.g. Spinoza database)

• Type of processing varies– Text strings need a lot of parsing– Structured data needs access to multiple tables

• Ideally, time alignment to AV source should be included also


What is gained

• Interoperability within the project– Data can be imported to the archive file from one

format and exported to another format• Interoperability outside the project

– People who wish to share data with a group will define transformations from their data formats

– A bottom-up approach to developing standards• Improved data modeling

– Encourages members of the project to revise their data formats

– Gives us help in developing high-level models for linguistic data


Future work

• Processing aligned text formats

• Using schemas rather than DTDs: data validation

• Improved version control, especially checking for duplicate or conflicting records


Some details

• This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora

• Funding:– Hans Rausing Endangered Languages Project – Australian Research Council– Faculty of Arts, Monash University

• Contacts:– [email protected]– http://www.arts.monash.edu.au/ling/maluku

mailto:[email protected]

http://www.arts.monash.edu.au/ling/maluku

language data and xml: archiving and interoperability

Documents

language data

heritage data

archiveaddition of data

xml facilityimporting

existing data structure

common data structuresthey

xml archive

advantages of xml