tei for language resources: a missed chance or a coming opportunity ?

TEI FOR LANGUAGE RESOURCES: A MISSED CHANCE OR A COMING OPPORTUNITY?

Tomaž ErjavecDept. of Knowledge TechnologiesJožef Stefan InstituteLjubljana, Slovenia

TEI for Language Resources 2/36

Overview1. Some history2. Why TEI isn‘t used for LRs (as much as expected)3. MULTEXT-East and other case studies4. Conclusions


HistoryAt its inception TEI was meant to cover CL/NLP LRs, esp. corpora:• ACL one of the supporting associations • modules for corpora, linguistic analysis, feature-structures, graphs

• BNC in TEI• At the time CL/NLP do not use SGML:clear playing field


The age of XML and LRsRelease of XML (more or less) corresponds to the begining of the era of Language resources:1998: XML 1.0, First LREC conference

But developed LRs (mostly) did not use TEI. Why?


Reason 1: (X)CES• EAGLES Corpus Encoding Standard

• „constraining or simplifying the TEI specifications in order to ensure interoperability“ (Ide 1998)

• So, more compact and easier to apply than TEI• Almost TEI, but not quite• No methods for extension


Reason 2: Comp Sci attitude• I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...)

• If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...)

• I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)


Reason 3: General gripes• Missing modules for syntactic analysis & lexical databases

• Not perscriptive / precise enough• Too general elements• Too book oriented


Result• Project-local proposals:

• TIGER treebank format• Concede lexical database format• GENIA NER format• ...

• Semantic Web: DC, RDF, OWL• ISO TC 37 SC4:

• LMF, isoCat, • LAF, MAF, SynAF, ...


MyTEI• MULTEXT-East: multilingual corpora and lexica• Fida(PLUS): Slovene Reference Corpus• IJS-ELAN, SVEZ-IJS: en-sl parallel corpora • jaSlo: Japanese-Slovene L2 dictionary• eZISS: Scholarly Digital Editions of Slovene Literature• JRC-ACQUIS: Parallel corpus of EC laws• SDT: Slovene Dependency Treebank• SBL: Slovene Biographic Lexicon• AHLib: DL/corpus of 19th century Slovene books• JOS: Slovene gold-standard corpus for HLT • MULTEXT-East...


MULTEXT-East• EU project 1995-97: MULTEXT sequel• Development of standardised language resources for Central and Eastern European languages + English hub

• Corpora, lexica, morphosyn. specifications • V1: 1998, 7 languages, LaTeX + CES/SGML• V4: 2010, 16 languages, TEI P5• http://nl.ijs.si/ME/


MULTEXT-East Version 4 by language and resource type


Why TEI for MTE?• Because I like TEI• Varied resources:

• Metadata / Documentation• „Document“ corpus: rich annotation structure• Lingustically annotated „1984“ corpus• Sentence alignments: stand-off markup• Morphosyntactic specifications: book-like

Either choose several (moving target) schemas or use TEI.


Documentation


TEI Header-v4-v3-v2-v1-eci-ota-soas-


Annotated 1984<text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl.1.2.2.1"> <w xml:id="Osl.1.2.2.1.1" lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl.1.2.2.1.2" lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl.1.2.2.1.3" lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.1.2.2.1.4">,</c> ← sorry! <w xml:id="Osl.1.2.2.1.5" lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl.1.2.2.1.6" lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl.1.2.2.1.7" lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl.1.2.2.1.8" lemma="in" ana="#Cc">in</w> <w xml:id="Osl.1.2.2.1.9" lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl.1.2.2.1.10" lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl.1.2.2.1.11" lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl.1.2.2.1.12" lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.1.2.2.1.13">.</c>


Whitespace• A long time ago „1984“ lost its spaces• Whitespace is brittle but important:

• Retokenisation• Reading

• TEI <space> no good!• So <mte:space> </mte:space>, 24:1?• Sitting on the fence JOS solution: </S>• <mte:g/>?


Sentence alignments

In MTE V3:<?xml version="1.0" encoding="us-ascii"?><!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd"><cesAlign version="4.1"> <linkList id="Oruen"> <linkGrp type="body" targType="s" domains="Oru Oen"> <link xtargets="Oru.1.1.1.1 ; Oen.1.1.1.1"/> <link xtargets="Oru.1.1.16.6 Oru.1.1.16.7 ; Oen.1.1.15.6"/> <link xtargets="Oru.1.3.4.1 ; Oen.1.3.4.1 Oen.1.3.4.2"/> <link xtargets=" ; Oen.1.3.4.3"/>

TEI P5 Alignments• TEI way is with two level indirection: 1st grouping, 2nd alignment

• Too complicated, esp. as 98% alignments are 1-1• Chose fence-sitting one-level:

<linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml"> <link n="1:1" targets="oana-mk.xml#Omk.1.1.1.1 oana-sl.xml#Osl.1.2.2.1"/> <link n="2:1" targets="oana-mk.xml#Omk.1.1.2.6 oana-mk.xml#Omk.1.1.2.7 oana-sl.xml#Osl.1.2.3.6"/> <link n="1:2" targets="oana-mk.xml#Omk.1.1.2.8 oana-sl.xml#Osl.1.2.3.7 oana-sl.xml#Osl.1.2.3.8"/>


Morphosyntactic specifications• Define categories (PoS) and their features• Map feature-structures to morphosyntactic descriptions (MSD tagsets)

• Specify which languages have which features and tagsets

• E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs Tagset∈ sl

• Complex morphology → complex specifications• MSD tagsets are grounded in lexicon and corpus


Example: common specifications<table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....


Language particular specifications <div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div>

MTEsl = JOS


Encoding• TEI provides needed elements, also for commentary,

bibliography, ...• TEI XSLT used to render as HTML• Tables retained from MULTEXT• Several XSLT scripts for MSD conversions, e.g. to

collating sequence, to fvLib and fsLib• Interesting challenge: conversion to isoCat (Adam P. for

Polish tagset), OWL


MTE specifications in OWL(by Christian Chiarcos)


Morals, 1• TEI good for in-place markup of richly annotated

resources with varied structure:• Readable• Updatable (validation)

• Not good for huge dataset with shallow annotation:• Processable• Read only

→ useful for (small, medium size) gold standard hand-corrected language resources/ „new“ langauges → localisation /


IMPACT @ JSI• EU IP „Improving Access to Text“• Make better OCR and IR for historical texts• JSI: Developing a lemmatisation (+ modernisation)

module for XIX century Slovene• Background: Lexicon, Tagging and Lemmatisation for

modern Slovene + FSA rewrite patterns• Current dataset: AHLib (~100 books)• AHLib marked up in TEI


AHLib Digital Library


IMPACT Lexicon


Mark-up challenges• Text-critical apparatus vs. linguistic annotation• „Parallel“ corpora of transcriptions and modernisations

• Layered linguistic annotations: tokenisation, tagsets

• Lexicon (+dictionary) encoding


Morals, 2• Text-critical editions use TEI anyway• Ditto for DLs of historical texts• HLT increasingly applied also to such texts• TEI provides a good basis to join the two views


Current EU Projects: FlareNet• Fostering Language Resources Network (2008-11)• WG4 - Harmonisation of Formats and Standards• D4.1 Identification of problems in the use of LR standards

and of standardisation needs (M12): • „For academic purposes the TEI Guidelines (current version P5) has

been a well established and widely used resource of LR‐specific standards mainly for corpus analysis, markup and annotation. But TEI is hardly known in industrial communities (with a few exceptions) and completely foreign to professional groups such as localizers and translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./

• D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)


Research Infrastructures for the Humanities

• DG Research funded RIs; pilot phase, 2008-2010• DARIAH ask Lou...• EU RI CLARIN:

Common Language Resources and Technology Infrastructure

• WP5 Language Resources and Technologies Overview• D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encoding digital text by following the P5 guidelines and conversion methods.“


Morals, 3• TEI is firmly acknowledged in current work on LR encoding standardisation

• But is not perscriptive enough and lacks modules for many types of LRs

→ Need of constrained solutions & linkages to ISO/W3C standards:

• Cross-walks• Roma & Schema „namespace“ catalogue to

DC, LMF, MAF, ...


TEI for LRSWOT

• Universality, Maturity, Community, Extensibility (compare ISO)

• Vagueness, Learning curve, ISO/W3C linkage

• HLT (Humanities Language Technologies), New languages

• Marginalisation, Technical obsolescence


Conclusions• Frontiers: DL+HLT, Gold standard LRs• Priority: Instantiated connections to other standards and languages

• Connection with linguistics? SIG will tell...

tei for language resources: a missed chance or a coming opportunity ?

Documents

language resources11why

tei specifications

inception tei

sime tei

teialmost tei

extension tei

tei p5http

language resources5reason