iso tc 37 / sc4 language resources an overview (ammended 2-5 février 2002) laurent romary
TRANSCRIPT
ISO TC 37 / SC4Language Resources
An overview
(Ammended 2-5 février 2002)
Laurent Romary
Standards for language processingPrimary resources
(text, dialogues)Structural mark-upBasic annotations
[TEI, MPEG7, TMX(XHTML…), etc.]
NLP structures(annotations)POS tagging
Chunks (cf. Named Entities)Deep Syntactic structures
Co-references etc.[Eagles/ISLE,
CES, MATE,…]
Knowledge structuresHierarchies of types
Relations between concepts(subjects/topics etc.)
Links to primary resources[Topic Maps, OIL, RDF]
Lexical structures(Language models)
TerminologiesTransfer lexica
LTAG/HPSG/LFG lexica[TBX, OLIF,
Eagles/ ISLE (Genelex)]
Links
Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]
Access protocols[Corba, SOAP]
Context
ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology
ISO 12200 - Martif Latest version of TEI Terminology chapter
ISO 12620 - Data categories ISO CD (DIS: under ballot) 16642 - TMF
(Terminological Markup Framework) SC4 - Language resources
TC37/SC4 details
Scope: Platform for designing and implementing linguistic resource formats and processes Multi-layer annotation of linguistic resources Exchange of information between NLP modules
General strategy Involve a wide community from academia and industry
Identification of experts in the various work items Involvment through national standardizing bodies
Agenda Current: identification of possible work items and working groups Constituancy meeting and technical workshop at LREC (May
2002)
Organization
Secretary: Prof. Key-Sun Choi, Korea
Chair: Laurent Romary, France
International Advisory Committee Permanent Chair: Prof. Antonio Zampolli, Italy
--------------------
SC4 and other standardizing bodies
W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP
MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices
ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats
TEI-text representationReference for primary sourcese.g.: text archives
Text
Audio/Speech
Technical background
What about gestures?• Kinetic in the TEI• SMIL?
Oscar
Contributing organizations
Working groups
WG1: Basic descriptors and mechanisms for language resources Convener: Laurent Romary
WG2: Representation schemes Convener: Kiyong Lee
WG3: Multilingual text representation Convener: Alan K. Melby
WG4: Lexical databases Convener: ??
WG5: Workflow of language Resource Management Convener: Christian Galinski
TC37/SC4 Work Items
WG1/WI-0: Terminology of Language Resources WG1/WI-1: Linguistic annotation framework WG1/WI-2: Meta-data for multimodal and
multilingual information
WG2/WI-3: Structural content representation scheme WG2/WI-4: Multimodal content representation sheme WG2/WI-5: Discourse level representation scheme
TC37/SC4 Work Items - cont.
WG3/WI-6a: Translation Memory, Alignment of parallel corpora
WG3/WI-6a: Segmentation and counting algorithms (characters, words, sentences etc.)
WG3/WI-6a: Meta-markup for GIL (Globalization, Internationalization and Localization)
WG4/WI-7: NLP Lexica WG5/WI-8: Validation of language resources WG5/WI-9: Net-based distributed cooperative work for
the creation of LRs
WI-0
Terminology of Language Resources Basic terminology of the various sub-fields of
language resources and general methodology Project leader: Klaus-Dirk Schmitz Sources:
ISO 1087 LREC proceedings + KAIST English dictionaries in Linguistics?
Support from GTW
WI-1
Linguistic annotation framework Basic mechanisms and data structures for linguistic
annotation and representation [data architecture] Methods and principles for the design of an annotation
scheme Structural nodes and information units, Data category
specification Linking and pointing mechanisms, Feature Structures,
Meta-Markup « Stand-off » and « in-line » views - equivalences,
combining levels. Administrative data categories
WI-1 - cont.
Project leader: Nancy Ide (TBC) Contributors: Alan Melby, Koiti Hasida, Lee Gillam,
Yves Savourel, Laurent Romary… Possible sources:
TMF, iso12620-revised, Mate (general methodology) TEI (Linking mechanisms, feature structures) Link with Linguistic DS
WI-2
Meta-data for multimodal and multilingual information Description of a meta-data representation scheme to
document linguistic information structures and processes General content description Local content description
Project leader: Peter Wittenburg, MPI (Nijmegen, NL) Participants: Steven Bird, TEI aware person Possible sources:
OLAC, Mile, TEI Header Liaison: TC46 (SC9), MPEG7/MDS, SCORM
WI-3
Structural content representation scheme Definition of annotation/representation scheme(s) for
morpho-syntax and syntax, to be used for annotation and interchange purposes
Meta-model for morpho-syntactic annotation Meta-model(s) for syntactic annotation (lexicalized
grammar, elementary trees, dependancy structures) + corresponding Data category registries
WI-3 - cont.
Project leader:John Carroll ?? Participants: Nuria Bell, … representatives from
existing TreeBanks initiatives Possible sources:
Eagles, TAGML, Linguistic DS SIGPARSE
WI-4
Multimodal meaning representation scheme Representation scheme for the semantic content of multimodal
information (textual, spoken, graphical and gestural) Meta-modal for content representation (Events, participants,
etc.) Data category registry for multimodal content
Project leader: Harry Bunt (id=“1”) Possible sources:
SIGSEM working group on semantic content Chair: #1
« Liaison » Semantic web activities
WI-5
Discourse level representation scheme Meta-model for discourse and dialogue
representation Meta-model for discourse level annotation (e.g.
reference annotation) + corresponding DatCat registry
Possible sources: SIGDIAL DRI - Discourse Resource Initiative Mate
WI 6a
Translation Memory, Alignment of parallel corpora Provides formats for the representation of multilingual textual
data as produced in translation activities or constructed from existing primary sources
Sources: OSCAR/TMX for translation memories TEI based linking mechanism (or see WI-1) for Parallel texts
WI 6b
Segmentation and counting algorithms (characters, words, sentences etc.) Provide methods for segmenting streams of text with
markup and means to for counting the corresponding segments
Possible sources: OSCAR
WI 6c
Meta-markup for GIL (Globalization, Internationalization and Localization) Identification of the specific markup modules needed to
perform GIL activities Possible sources:
OSCAR/OpenTag
WI-7
NLP lexica Lexicon representation formats for the various types of NLP
applications (Machine Readable Lexica) Define a set of meta-models (classes of applications) Specific data categories (derivation, phonology, etc.) Based on the work done in other work items
Possible sources Eagles Multext ISLE Computational lexicon Working group OLIF
WI-8
Validation of language resources Defines guidelines and requirements for producing
and distributing high quality language resources Contacts:
ELRA, TEI Possibles sources:
To be defined
WI-9
Net-based distributed cooperative work for the creation of LRs Principles and methods for designing collaborative and
cooperative compilation of LRs Define what is specific to LRs with regards
Tracability of resources, version control, validation, quality management
Protocols (Corba, SOAP), Workflow standards, Data management
Contacts: Christian Galinski, Remi Zajac, … Sources: To be defined
Liaison - OSCAR (AKM)
Brief history of LR exchange standards Parallel events since 1997
Open Tag - meta-markup (XML vs. Others) Major current OSCAR activities
TMX - Translation Memory eXchange Counting and segmentation algorithms TBX (Terminologies) and OLIF (MT lexica) XLIFF and CGS - Annotation of source code and
localisation of web sites xml:lang etc.: J. DeCamp and S.-E. Wright
Liaison - TEI (LR)
General architecture and data modeling WI-1
Annotations (paragraph level, external annotations) WI-1
TEI Header WI-2
NLP lexica (with regards Terminologies and dictionaries) WI-7
Feature structures WI-1