tei for language resources: a missed chance or a coming opportunity ?
DESCRIPTION
TEI for language resources: a missed chance or a coming opportunity ?. Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia. Overview. Some history Why TEI isn‘t used for LRs (as much as expected) MULTEXT-East and other case studies Conclusions. - PowerPoint PPT PresentationTRANSCRIPT
TEI FOR LANGUAGE RESOURCES: A MISSED CHANCE OR A COMING OPPORTUNITY?
Tomaž ErjavecDept. of Knowledge TechnologiesJožef Stefan InstituteLjubljana, Slovenia
TEI for Language Resources 2/36
Overview1. Some history2. Why TEI isn‘t used for LRs (as much as expected)3. MULTEXT-East and other case studies4. Conclusions
TEI for Language Resources 3/36
HistoryAt its inception TEI was meant to cover CL/NLP LRs, esp. corpora:• ACL one of the supporting associations • modules for corpora, linguistic analysis, feature-structures, graphs
• BNC in TEI• At the time CL/NLP do not use SGML:clear playing field
TEI for Language Resources 4/36
The age of XML and LRsRelease of XML (more or less) corresponds to the begining of the era of Language resources:1998: XML 1.0, First LREC conference
But developed LRs (mostly) did not use TEI. Why?
TEI for Language Resources 5/36
Reason 1: (X)CES• EAGLES Corpus Encoding Standard
• „constraining or simplifying the TEI specifications in order to ensure interoperability“ (Ide 1998)
• So, more compact and easier to apply than TEI• Almost TEI, but not quite• No methods for extension
TEI for Language Resources 6/36
Reason 2: Comp Sci attitude• I don‘t care about the data format, I want to develop algorithms! (... I even hate XML...)
• If I use XML I will roll my own schema optimal for my experiments / application (...that‘s what ‚X‘ means...)
• I won‘t spend weeks (months, years) just getting to know TEI (...I need only 4 different elements anyway...)
TEI for Language Resources 7/36
Reason 3: General gripes• Missing modules for syntactic analysis & lexical databases
• Not perscriptive / precise enough• Too general elements• Too book oriented
TEI for Language Resources 8/36
Result• Project-local proposals:
• TIGER treebank format• Concede lexical database format• GENIA NER format• ...
• Semantic Web: DC, RDF, OWL• ISO TC 37 SC4:
• LMF, isoCat, • LAF, MAF, SynAF, ...
TEI for Language Resources 9/36
MyTEI• MULTEXT-East: multilingual corpora and lexica• Fida(PLUS): Slovene Reference Corpus• IJS-ELAN, SVEZ-IJS: en-sl parallel corpora • jaSlo: Japanese-Slovene L2 dictionary• eZISS: Scholarly Digital Editions of Slovene Literature• JRC-ACQUIS: Parallel corpus of EC laws• SDT: Slovene Dependency Treebank• SBL: Slovene Biographic Lexicon• AHLib: DL/corpus of 19th century Slovene books• JOS: Slovene gold-standard corpus for HLT • MULTEXT-East...
TEI for Language Resources 10/36
MULTEXT-East• EU project 1995-97: MULTEXT sequel• Development of standardised language resources for Central and Eastern European languages + English hub
• Corpora, lexica, morphosyn. specifications • V1: 1998, 7 languages, LaTeX + CES/SGML• V4: 2010, 16 languages, TEI P5• http://nl.ijs.si/ME/
TEI for Language Resources 11/36
MULTEXT-East Version 4 by language and resource type
TEI for Language Resources 12/36
Why TEI for MTE?• Because I like TEI• Varied resources:
• Metadata / Documentation• „Document“ corpus: rich annotation structure• Lingustically annotated „1984“ corpus• Sentence alignments: stand-off markup• Morphosyntactic specifications: book-like
Either choose several (moving target) schemas or use TEI.
TEI for Language Resources 13/36
Documentation
TEI for Language Resources 14/36
TEI Header-v4-v3-v2-v1-eci-ota-soas-
TEI for Language Resources 15/36
Annotated 1984<text xml:id="Osl." xml:lang="sl"> <body> <div type="part" xml:id="Osl.1"> <div type="chapter" xml:id="Osl.1.2"> <p xml:id="Osl.1.2.2"> <s xml:id="Osl.1.2.2.1"> <w xml:id="Osl.1.2.2.1.1" lemma="biti" ana="#Va-p-sm">Bil</w> <w xml:id="Osl.1.2.2.1.2" lemma="biti" ana="#Va-r3s-n">je</w> <w xml:id="Osl.1.2.2.1.3" lemma="jasen" ana="#Agpmsnn">jasen</w> <c xml:id="Osl.1.2.2.1.4">,</c> ← sorry! <w xml:id="Osl.1.2.2.1.5" lemma="mrzel" ana="#Agpmsnn">mrzel</w> <w xml:id="Osl.1.2.2.1.6" lemma="aprilski" ana="#Agpmsny">aprilski</w> <w xml:id="Osl.1.2.2.1.7" lemma="dan" ana="#Ncmsn">dan</w> <w xml:id="Osl.1.2.2.1.8" lemma="in" ana="#Cc">in</w> <w xml:id="Osl.1.2.2.1.9" lemma="ura" ana="#Ncfpn">ure</w> <w xml:id="Osl.1.2.2.1.10" lemma="biti" ana="#Va-r3p-n">so</w> <w xml:id="Osl.1.2.2.1.11" lemma="biti" ana="#Va-p-pf">bile</w> <w xml:id="Osl.1.2.2.1.12" lemma="trinajst" ana="#Mlc-pa">trinajst</w> <c xml:id="Osl.1.2.2.1.13">.</c>
TEI for Language Resources 16/36
Whitespace• A long time ago „1984“ lost its spaces• Whitespace is brittle but important:
• Retokenisation• Reading
• TEI <space> no good!• So <mte:space> </mte:space>, 24:1?• Sitting on the fence JOS solution: </S>• <mte:g/>?
TEI for Language Resources 17/36
Sentence alignments
In MTE V3:<?xml version="1.0" encoding="us-ascii"?><!DOCTYPE cesAlign SYSTEM "xcesAlign.dtd"><cesAlign version="4.1"> <linkList id="Oruen"> <linkGrp type="body" targType="s" domains="Oru Oen"> <link xtargets="Oru.1.1.1.1 ; Oen.1.1.1.1"/> <link xtargets="Oru.1.1.16.6 Oru.1.1.16.7 ; Oen.1.1.15.6"/> <link xtargets="Oru.1.3.4.1 ; Oen.1.3.4.1 Oen.1.3.4.2"/> <link xtargets=" ; Oen.1.3.4.3"/>
TEI for Language Resources 18/36
TEI P5 Alignments• TEI way is with two level indirection: 1st grouping, 2nd alignment
• Too complicated, esp. as 98% alignments are 1-1• Chose fence-sitting one-level:
<linkGrp type="alignment" corresp="oana-mk.xml oana-sl.xml"> <link n="1:1" targets="oana-mk.xml#Omk.1.1.1.1 oana-sl.xml#Osl.1.2.2.1"/> <link n="2:1" targets="oana-mk.xml#Omk.1.1.2.6 oana-mk.xml#Omk.1.1.2.7 oana-sl.xml#Osl.1.2.3.6"/> <link n="1:2" targets="oana-mk.xml#Omk.1.1.2.8 oana-sl.xml#Osl.1.2.3.7 oana-sl.xml#Osl.1.2.3.8"/> <!--link n="0:1" targets="oana-sl.xml#Osl.4.12.2"/-->
TEI for Language Resources 19/36
Morphosyntactic specifications• Define categories (PoS) and their features• Map feature-structures to morphosyntactic descriptions (MSD tagsets)
• Specify which languages have which features and tagsets
• E.g. [Category=Adverb Type=general Degree=superlative] ≡ Rgs Tagset∈ sl
• Complex morphology → complex specifications• MSD tagsets are grounded in lexicon and corpus
TEI for Language Resources 20/36
Example: common specifications<table n="msd.cat" xml:lang="en" xml:id="msd.cat.Q"> <head>Common specifications for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name">CATEGORY</cell> <cell role="value">Particle</cell> <cell role="code">Q</cell> <cell role="lang">ro</cell> <cell role="lang">sl</cell> ... </row> <row role="attribute"> <cell role="position">1</cell> <cell role="name">Type</cell> <cell> <table> <row role="value"> <cell role="name">negative</cell> <cell role="code">z</cell> <cell role="lang">ro</cell> </row> <row role="value"> <cell role="name">interrogative</cell> <cell role="code">q</cell> <cell role="lang">bg</cell> <cell role="lang">hr</cell>....
TEI for Language Resources 21/36
TEI for Language Resources 22/36
Language particular specifications <div type="section" select="sl" xml:id="msd.Q-sl"> <head>Slovene Particle</head> <table n="msd.cat" select="sl" xml:id="msd.cat.Q-sl"> <head>Slovene Specification for Particle</head> <row role="type"> <cell role="position">0</cell> <cell role="name" xml:lang="sl">besedna_vrsta</cell> <cell role="value" xml:lang="sl">členek</cell> <cell role="code" xml:lang="sl">L</cell> <cell role="name" xml:lang="en">CATEGORY</cell> <cell role="value" xml:lang="en">Particle</cell> <cell role="code" xml:lang="en">Q</cell> </row> </table> <p xml:lang="sl">Opombe: <list> <item>kot členki so označene le pojavnice, ki so navedene v leksikonu</item> </list> </p> <divGen xml:id="msd.Q-sl.lexicon" type="msd.lex" select="sl"/> </div>
MTEsl = JOS
TEI for Language Resources 23/36
TEI for Language Resources 24/36
Encoding• TEI provides needed elements, also for commentary,
bibliography, ...• TEI XSLT used to render as HTML• Tables retained from MULTEXT• Several XSLT scripts for MSD conversions, e.g. to
collating sequence, to fvLib and fsLib• Interesting challenge: conversion to isoCat (Adam P. for
Polish tagset), OWL
TEI for Language Resources 25/36
MTE specifications in OWL(by Christian Chiarcos)
TEI for Language Resources 26/36
Morals, 1• TEI good for in-place markup of richly annotated
resources with varied structure:• Readable• Updatable (validation)
• Not good for huge dataset with shallow annotation:• Processable• Read only
→ useful for (small, medium size) gold standard hand-corrected language resources/ „new“ langauges → localisation /
TEI for Language Resources 27/36
IMPACT @ JSI• EU IP „Improving Access to Text“• Make better OCR and IR for historical texts• JSI: Developing a lemmatisation (+ modernisation)
module for XIX century Slovene• Background: Lexicon, Tagging and Lemmatisation for
modern Slovene + FSA rewrite patterns• Current dataset: AHLib (~100 books)• AHLib marked up in TEI
TEI for Language Resources 28/36
AHLib Digital Library
TEI for Language Resources 29/36
IMPACT Lexicon
TEI for Language Resources 30/36
Mark-up challenges• Text-critical apparatus vs. linguistic annotation• „Parallel“ corpora of transcriptions and modernisations
• Layered linguistic annotations: tokenisation, tagsets
• Lexicon (+dictionary) encoding
TEI for Language Resources 31/36
Morals, 2• Text-critical editions use TEI anyway• Ditto for DLs of historical texts• HLT increasingly applied also to such texts• TEI provides a good basis to join the two views
TEI for Language Resources 32/36
Current EU Projects: FlareNet• Fostering Language Resources Network (2008-11)• WG4 - Harmonisation of Formats and Standards• D4.1 Identification of problems in the use of LR standards
and of standardisation needs (M12): • „For academic purposes the TEI Guidelines (current version P5) has
been a well established and widely used resource of LR‐specific standards mainly for corpus analysis, markup and annotation. But TEI is hardly known in industrial communities (with a few exceptions) and completely foreign to professional groups such as localizers and translators. We see great potential in using TEI Guidelines in industrial contexts.“ /underlined by T.E./
• D4.2 Proposal of a European Language Resource Standards Framework (M24 /2010-09-01)
TEI for Language Resources 33/36
Research Infrastructures for the Humanities
• DG Research funded RIs; pilot phase, 2008-2010• DARIAH ask Lou...• EU RI CLARIN:
Common Language Resources and Technology Infrastructure
• WP5 Language Resources and Technologies Overview• D5C-3: Interoperability & Standards: „Due to the versatile nature of TEI, most of the following chapters include details on encoding digital text by following the P5 guidelines and conversion methods.“
TEI for Language Resources 34/36
Morals, 3• TEI is firmly acknowledged in current work on LR encoding standardisation
• But is not perscriptive enough and lacks modules for many types of LRs
→ Need of constrained solutions & linkages to ISO/W3C standards:
• Cross-walks• Roma & Schema „namespace“ catalogue to
DC, LMF, MAF, ...
TEI for Language Resources 35/36
TEI for LRSWOT
• Universality, Maturity, Community, Extensibility (compare ISO)
• Vagueness, Learning curve, ISO/W3C linkage
• HLT (Humanities Language Technologies), New languages
• Marginalisation, Technical obsolescence
TEI for Language Resources 36/36
Conclusions• Frontiers: DL+HLT, Gold standard LRs• Priority: Instantiated connections to other standards and languages
• Connection with linguistics? SIG will tell...