iso tc 37 / sc4 language resources an overview (ammended 2-5 février 2002) laurent romary

25
ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Upload: peter-scott

Post on 12-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

ISO TC 37 / SC4Language Resources

An overview

(Ammended 2-5 février 2002)

Laurent Romary

Page 2: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Standards for language processingPrimary resources

(text, dialogues)Structural mark-upBasic annotations

[TEI, MPEG7, TMX(XHTML…), etc.]

NLP structures(annotations)POS tagging

Chunks (cf. Named Entities)Deep Syntactic structures

Co-references etc.[Eagles/ISLE,

CES, MATE,…]

Knowledge structuresHierarchies of types

Relations between concepts(subjects/topics etc.)

Links to primary resources[Topic Maps, OIL, RDF]

Lexical structures(Language models)

TerminologiesTransfer lexica

LTAG/HPSG/LFG lexica[TBX, OLIF,

Eagles/ ISLE (Genelex)]

Links

Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]

Access protocols[Corba, SOAP]

Page 3: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Context

ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology

ISO 12200 - Martif Latest version of TEI Terminology chapter

ISO 12620 - Data categories ISO CD (DIS: under ballot) 16642 - TMF

(Terminological Markup Framework) SC4 - Language resources

Page 4: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

TC37/SC4 details

Scope: Platform for designing and implementing linguistic resource formats and processes Multi-layer annotation of linguistic resources Exchange of information between NLP modules

General strategy Involve a wide community from academia and industry

Identification of experts in the various work items Involvment through national standardizing bodies

Agenda Current: identification of possible work items and working groups Constituancy meeting and technical workshop at LREC (May

2002)

Page 5: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Organization

Secretary: Prof. Key-Sun Choi, Korea

Chair: Laurent Romary, France

International Advisory Committee Permanent Chair: Prof. Antonio Zampolli, Italy

Page 6: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

--------------------

SC4 and other standardizing bodies

W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP

MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices

ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats

TEI-text representationReference for primary sourcese.g.: text archives

Text

Audio/Speech

Technical background

What about gestures?• Kinetic in the TEI• SMIL?

Oscar

Contributing organizations

Page 7: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Working groups

WG1: Basic descriptors and mechanisms for language resources Convener: Laurent Romary

WG2: Representation schemes Convener: Kiyong Lee

WG3: Multilingual text representation Convener: Alan K. Melby

WG4: Lexical databases Convener: ??

WG5: Workflow of language Resource Management Convener: Christian Galinski

Page 8: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

TC37/SC4 Work Items

WG1/WI-0: Terminology of Language Resources WG1/WI-1: Linguistic annotation framework WG1/WI-2: Meta-data for multimodal and

multilingual information

WG2/WI-3: Structural content representation scheme WG2/WI-4: Multimodal content representation sheme WG2/WI-5: Discourse level representation scheme

Page 9: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

TC37/SC4 Work Items - cont.

WG3/WI-6a: Translation Memory, Alignment of parallel corpora

WG3/WI-6a: Segmentation and counting algorithms (characters, words, sentences etc.)

WG3/WI-6a: Meta-markup for GIL (Globalization, Internationalization and Localization)

WG4/WI-7: NLP Lexica WG5/WI-8: Validation of language resources WG5/WI-9: Net-based distributed cooperative work for

the creation of LRs

Page 10: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-0

Terminology of Language Resources Basic terminology of the various sub-fields of

language resources and general methodology Project leader: Klaus-Dirk Schmitz Sources:

ISO 1087 LREC proceedings + KAIST English dictionaries in Linguistics?

Support from GTW

Page 11: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-1

Linguistic annotation framework Basic mechanisms and data structures for linguistic

annotation and representation [data architecture] Methods and principles for the design of an annotation

scheme Structural nodes and information units, Data category

specification Linking and pointing mechanisms, Feature Structures,

Meta-Markup « Stand-off » and « in-line » views - equivalences,

combining levels. Administrative data categories

Page 12: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-1 - cont.

Project leader: Nancy Ide (TBC) Contributors: Alan Melby, Koiti Hasida, Lee Gillam,

Yves Savourel, Laurent Romary… Possible sources:

TMF, iso12620-revised, Mate (general methodology) TEI (Linking mechanisms, feature structures) Link with Linguistic DS

Page 13: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-2

Meta-data for multimodal and multilingual information Description of a meta-data representation scheme to

document linguistic information structures and processes General content description Local content description

Project leader: Peter Wittenburg, MPI (Nijmegen, NL) Participants: Steven Bird, TEI aware person Possible sources:

OLAC, Mile, TEI Header Liaison: TC46 (SC9), MPEG7/MDS, SCORM

Page 14: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-3

Structural content representation scheme Definition of annotation/representation scheme(s) for

morpho-syntax and syntax, to be used for annotation and interchange purposes

Meta-model for morpho-syntactic annotation Meta-model(s) for syntactic annotation (lexicalized

grammar, elementary trees, dependancy structures) + corresponding Data category registries

Page 15: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-3 - cont.

Project leader:John Carroll ?? Participants: Nuria Bell, … representatives from

existing TreeBanks initiatives Possible sources:

Eagles, TAGML, Linguistic DS SIGPARSE

Page 16: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-4

Multimodal meaning representation scheme Representation scheme for the semantic content of multimodal

information (textual, spoken, graphical and gestural) Meta-modal for content representation (Events, participants,

etc.) Data category registry for multimodal content

Project leader: Harry Bunt (id=“1”) Possible sources:

SIGSEM working group on semantic content Chair: #1

« Liaison » Semantic web activities

Page 17: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-5

Discourse level representation scheme Meta-model for discourse and dialogue

representation Meta-model for discourse level annotation (e.g.

reference annotation) + corresponding DatCat registry

Possible sources: SIGDIAL DRI - Discourse Resource Initiative Mate

Page 18: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI 6a

Translation Memory, Alignment of parallel corpora Provides formats for the representation of multilingual textual

data as produced in translation activities or constructed from existing primary sources

Sources: OSCAR/TMX for translation memories TEI based linking mechanism (or see WI-1) for Parallel texts

Page 19: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI 6b

Segmentation and counting algorithms (characters, words, sentences etc.) Provide methods for segmenting streams of text with

markup and means to for counting the corresponding segments

Possible sources: OSCAR

Page 20: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI 6c

Meta-markup for GIL (Globalization, Internationalization and Localization) Identification of the specific markup modules needed to

perform GIL activities Possible sources:

OSCAR/OpenTag

Page 21: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-7

NLP lexica Lexicon representation formats for the various types of NLP

applications (Machine Readable Lexica) Define a set of meta-models (classes of applications) Specific data categories (derivation, phonology, etc.) Based on the work done in other work items

Possible sources Eagles Multext ISLE Computational lexicon Working group OLIF

Page 22: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-8

Validation of language resources Defines guidelines and requirements for producing

and distributing high quality language resources Contacts:

ELRA, TEI Possibles sources:

To be defined

Page 23: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

WI-9

Net-based distributed cooperative work for the creation of LRs Principles and methods for designing collaborative and

cooperative compilation of LRs Define what is specific to LRs with regards

Tracability of resources, version control, validation, quality management

Protocols (Corba, SOAP), Workflow standards, Data management

Contacts: Christian Galinski, Remi Zajac, … Sources: To be defined

Page 24: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Liaison - OSCAR (AKM)

Brief history of LR exchange standards Parallel events since 1997

Open Tag - meta-markup (XML vs. Others) Major current OSCAR activities

TMX - Translation Memory eXchange Counting and segmentation algorithms TBX (Terminologies) and OLIF (MT lexica) XLIFF and CGS - Annotation of source code and

localisation of web sites xml:lang etc.: J. DeCamp and S.-E. Wright

Page 25: ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 février 2002) Laurent Romary

Liaison - TEI (LR)

General architecture and data modeling WI-1

Annotations (paragraph level, external annotations) WI-1

TEI Header WI-2

NLP lexica (with regards Terminologies and dictionaries) WI-7

Feature structures WI-1