language data and xml: archiving and interoperability
DESCRIPTION
Language data and XML: archiving and interoperability. Simon Musgrave Linguistics Program Monash University ([email protected]). Language documentation. Language documentation produces large quantities of text Transcribed language events associated annotations - PowerPoint PPT PresentationTRANSCRIPT
Language data and XML:archiving and interoperability
Simon MusgraveLinguistics Program
Monash University
DRH 2003 - Cheltenham 2/9/03
Language documentation
• Language documentation produces large quantities of text– Transcribed language events– associated annotations – lexica / dictionaries – analyses – ethnographic notes– …….
• There is no standard software tool used by linguists
• Use of proprietary software results in file formats with limited portability
DRH 2003 - Cheltenham 2/9/03
Advantages of XML: Archiving
• UNICODE compatibility assured– Besides script possibilities, access to the full
International Phonetic Alphabet character set is important for linguists
• Explicit coding of data model
• Generic file format assures better portability and lifespan
DRH 2003 - Cheltenham 2/9/03
Building an archive
• Addition of data to an XML archive should be automated
• This implies the existence of transformation scripts to move data between formats
• Creating these scripts is work which has to be done
• It can have a second benefit
DRH 2003 - Cheltenham 2/9/03
Advantages of XML: Interoperability
• Members of a research team may use different software running on different platforms
• Problems can arise in sharing data• An important use of XML is as an
interchange format• Transformation scripts created for
archiving can also be used for sharing data
DRH 2003 - Cheltenham 2/9/03
Data structures - 1
• Researchers may not agree on common data structures– They are used to working with one tool in one
particular way– Their interests are different
• Even if they agree on a data structure for current work, heritage data may have to be imported to the archive
DRH 2003 - Cheltenham 2/9/03
Data structures - 2
• Archive files must be able to hold all the information coded in all the possible input formats - there should be no loss of data
• We can think of this in terms of the logic of attribute-value matrices: all inputs must be able to unify with the general data structure
• Where possible, correspondences will be made between the information in different input files
DRH 2003 - Cheltenham 2/9/03
Example: Dictionary files
• The prototype implementation of the process uses a simple type of information: dictionary files
• Source 1 is a FilemakerPro database of lexical material from the language Nusalaut
• Source 2 is a table in an Access database containing data from several languages
DRH 2003 - Cheltenham 2/9/03
Source 1
DRH 2003 - Cheltenham 2/9/03
Source 2
DRH 2003 - Cheltenham 2/9/03
Process overview
DRH 2003 - Cheltenham 2/9/03
Stage 1 – txt to xml
• Data exported from database as delimited text file
• A document type description (DTD) is created for each source file– This replicates the existing data structure,
possibly with additions
• A Perl script reads data from the txt file and adds tags based on the DTD
DRH 2003 - Cheltenham 2/9/03
Sample: specific XML
DRH 2003 - Cheltenham 2/9/03
Stage 1 – Why?
• Newer versions of commercial software offer an export to XML facility
• Importing data from a normalized database often means having access to data from more than one table– XSLT takes a single input file– Perl (or an equivalent) does not have this
limitation
• Type conversion can be done using Perl
DRH 2003 - Cheltenham 2/9/03
Stage 2 – XML1 to XML2
• DTD for archive file has a place for all information in all input files
• More structure imposed at this level– Stage 1 used only elements – Stage 2 uses attributes, mainly for metadata– “Pseudo-normalization”: recurring data substructures
treated as optionally recurring elements – the archive data structure is actually more general than ANY of the inputs
• Date stamping done at this stage
DRH 2003 - Cheltenham 2/9/03
Sample: General XML 1
DRH 2003 - Cheltenham 2/9/03
Sample: General XML 2
DRH 2003 - Cheltenham 2/9/03
Exporting Data
• XSLT with <xsl:output method=“text”/>
• The only complication is undoing “pseudo-normalization”
DRH 2003 - Cheltenham 2/9/03
A more complex problem: aligned interlinear text
• Important way of presenting data for linguists• Various lines of annotation, different levels have
different alignment patterns
DRH 2003 - Cheltenham 2/9/03
The Bird, Bow & Hughes Model
• Bird, Steven, Cathy Bow and Baden Hughes (2003) A generalised model of interlinear text Proceedings of the EMELD Workshop
• A general data model for representing this type of information
• Four levels:– Text– Phrase– Word– Morpheme
DRH 2003 - Cheltenham 2/9/03
XML model for aligned text
DRH 2003 - Cheltenham 2/9/03
Aligned text: Problems
• Various types of input:– Text strings with space and/or tabs (Shoebox)– Formatted text (e.g. Word tables)– Structured data (e.g. Spinoza database)
• Type of processing varies– Text strings need a lot of parsing– Structured data needs access to multiple tables
• Ideally, time alignment to AV source should be included also
DRH 2003 - Cheltenham 2/9/03
What is gained
• Interoperability within the project– Data can be imported to the archive file from one
format and exported to another format• Interoperability outside the project
– People who wish to share data with a group will define transformations from their data formats
– A bottom-up approach to developing standards• Improved data modeling
– Encourages members of the project to revise their data formats
– Gives us help in developing high-level models for linguistic data
DRH 2003 - Cheltenham 2/9/03
Future work
• Processing aligned text formats
• Using schemas rather than DTDs: data validation
• Improved version control, especially checking for duplicate or conflicting records
DRH 2003 - Cheltenham 2/9/03
Some details
• This work is part of the project Endangered Maluku Languages: Eastern Indonesia and the Dutch Diaspora
• Funding:– Hans Rausing Endangered Languages Project – Australian Research Council– Faculty of Arts, Monash University
• Contacts:– [email protected]– http://www.arts.monash.edu.au/ling/maluku