iso tc 37 / sc4 language resources an overview (ammended 2-5 février 2002) laurent romary

ISO TC 37 / SC4Language Resources

An overview

(Ammended 2-5 février 2002)

Laurent Romary

Standards for language processingPrimary resources

(text, dialogues)Structural mark-upBasic annotations

[TEI, MPEG7, TMX(XHTML…), etc.]

NLP structures(annotations)POS tagging

Chunks (cf. Named Entities)Deep Syntactic structures

Co-references etc.[Eagles/ISLE,

CES, MATE,…]

Knowledge structuresHierarchies of types

Relations between concepts(subjects/topics etc.)

Links to primary resources[Topic Maps, OIL, RDF]

Lexical structures(Language models)

TerminologiesTransfer lexica

LTAG/HPSG/LFG lexica[TBX, OLIF,

Eagles/ ISLE (Genelex)]

Links

Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]

Access protocols[Corba, SOAP]

Context

ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology

ISO 12200 - Martif Latest version of TEI Terminology chapter

ISO 12620 - Data categories ISO CD (DIS: under ballot) 16642 - TMF

(Terminological Markup Framework) SC4 - Language resources

TC37/SC4 details

Scope: Platform for designing and implementing linguistic resource formats and processes Multi-layer annotation of linguistic resources Exchange of information between NLP modules

General strategy Involve a wide community from academia and industry

Identification of experts in the various work items Involvment through national standardizing bodies

Agenda Current: identification of possible work items and working groups Constituancy meeting and technical workshop at LREC (May

2002)

Organization

Secretary: Prof. Key-Sun Choi, Korea

Chair: Laurent Romary, France

International Advisory Committee Permanent Chair: Prof. Antonio Zampolli, Italy

--------------------

SC4 and other standardizing bodies

W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP

MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices

ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats

TEI-text representationReference for primary sourcese.g.: text archives

Text

Audio/Speech

Technical background

What about gestures?• Kinetic in the TEI• SMIL?

Oscar

Contributing organizations

Working groups

WG1: Basic descriptors and mechanisms for language resources Convener: Laurent Romary

WG2: Representation schemes Convener: Kiyong Lee

WG3: Multilingual text representation Convener: Alan K. Melby

WG4: Lexical databases Convener: ??

WG5: Workflow of language Resource Management Convener: Christian Galinski

TC37/SC4 Work Items

WG1/WI-0: Terminology of Language Resources WG1/WI-1: Linguistic annotation framework WG1/WI-2: Meta-data for multimodal and

multilingual information

WG2/WI-3: Structural content representation scheme WG2/WI-4: Multimodal content representation sheme WG2/WI-5: Discourse level representation scheme

TC37/SC4 Work Items - cont.

WG3/WI-6a: Translation Memory, Alignment of parallel corpora

WG3/WI-6a: Segmentation and counting algorithms (characters, words, sentences etc.)

WG3/WI-6a: Meta-markup for GIL (Globalization, Internationalization and Localization)

WG4/WI-7: NLP Lexica WG5/WI-8: Validation of language resources WG5/WI-9: Net-based distributed cooperative work for

the creation of LRs

WI-0

Terminology of Language Resources Basic terminology of the various sub-fields of

language resources and general methodology Project leader: Klaus-Dirk Schmitz Sources:

ISO 1087 LREC proceedings + KAIST English dictionaries in Linguistics?

Support from GTW

WI-1

Linguistic annotation framework Basic mechanisms and data structures for linguistic

annotation and representation [data architecture] Methods and principles for the design of an annotation

scheme Structural nodes and information units, Data category

specification Linking and pointing mechanisms, Feature Structures,

Meta-Markup « Stand-off » and « in-line » views - equivalences,

combining levels. Administrative data categories

WI-1 - cont.

Project leader: Nancy Ide (TBC) Contributors: Alan Melby, Koiti Hasida, Lee Gillam,

Yves Savourel, Laurent Romary… Possible sources:

TMF, iso12620-revised, Mate (general methodology) TEI (Linking mechanisms, feature structures) Link with Linguistic DS

WI-2

Meta-data for multimodal and multilingual information Description of a meta-data representation scheme to

document linguistic information structures and processes General content description Local content description

Project leader: Peter Wittenburg, MPI (Nijmegen, NL) Participants: Steven Bird, TEI aware person Possible sources:

OLAC, Mile, TEI Header Liaison: TC46 (SC9), MPEG7/MDS, SCORM

WI-3

Structural content representation scheme Definition of annotation/representation scheme(s) for

morpho-syntax and syntax, to be used for annotation and interchange purposes

Meta-model for morpho-syntactic annotation Meta-model(s) for syntactic annotation (lexicalized

grammar, elementary trees, dependancy structures) + corresponding Data category registries

WI-3 - cont.

Project leader:John Carroll ?? Participants: Nuria Bell, … representatives from

existing TreeBanks initiatives Possible sources:

Eagles, TAGML, Linguistic DS SIGPARSE

WI-4

Multimodal meaning representation scheme Representation scheme for the semantic content of multimodal

information (textual, spoken, graphical and gestural) Meta-modal for content representation (Events, participants,

etc.) Data category registry for multimodal content

Project leader: Harry Bunt (id=“1”) Possible sources:

SIGSEM working group on semantic content Chair: #1

« Liaison » Semantic web activities

WI-5

Discourse level representation scheme Meta-model for discourse and dialogue

representation Meta-model for discourse level annotation (e.g.

reference annotation) + corresponding DatCat registry

Possible sources: SIGDIAL DRI - Discourse Resource Initiative Mate

WI 6a

Translation Memory, Alignment of parallel corpora Provides formats for the representation of multilingual textual

data as produced in translation activities or constructed from existing primary sources

Sources: OSCAR/TMX for translation memories TEI based linking mechanism (or see WI-1) for Parallel texts

WI 6b

Segmentation and counting algorithms (characters, words, sentences etc.) Provide methods for segmenting streams of text with

markup and means to for counting the corresponding segments

Possible sources: OSCAR

WI 6c

Meta-markup for GIL (Globalization, Internationalization and Localization) Identification of the specific markup modules needed to

perform GIL activities Possible sources:

OSCAR/OpenTag

WI-7

NLP lexica Lexicon representation formats for the various types of NLP

applications (Machine Readable Lexica) Define a set of meta-models (classes of applications) Specific data categories (derivation, phonology, etc.) Based on the work done in other work items

Possible sources Eagles Multext ISLE Computational lexicon Working group OLIF

WI-8

Validation of language resources Defines guidelines and requirements for producing

and distributing high quality language resources Contacts:

ELRA, TEI Possibles sources:

To be defined

WI-9

Net-based distributed cooperative work for the creation of LRs Principles and methods for designing collaborative and

cooperative compilation of LRs Define what is specific to LRs with regards

Tracability of resources, version control, validation, quality management

Protocols (Corba, SOAP), Workflow standards, Data management

Contacts: Christian Galinski, Remi Zajac, … Sources: To be defined

Liaison - OSCAR (AKM)

Brief history of LR exchange standards Parallel events since 1997

Open Tag - meta-markup (XML vs. Others) Major current OSCAR activities

TMX - Translation Memory eXchange Counting and segmentation algorithms TBX (Terminologies) and OLIF (MT lexica) XLIFF and CGS - Annotation of source code and

localisation of web sites xml:lang etc.: J. DeCamp and S.-E. Wright

Liaison - TEI (LR)

General architecture and data modeling WI-1

Annotations (paragraph level, external annotations) WI-1

TEI Header WI-2

NLP lexica (with regards Terminologies and dictionaries) WI-7

Feature structures WI-1

iso tc 37 / sc4 language resources an overview (ammended 2-5 février 2002) laurent romary

Documents