language resources: a pillar of language technology · pdf filefostering language resources...

18
N. Calzolari 1 Multilingual Web, Madrid, October 2010 Nicoletta Calzolari Istituto di Linguistica Computazionale - CNR - Pisa [email protected] Language Resources: a pillar of Language Technology In the Multilingual Web perspective

Upload: trinhdien

Post on 24-Mar-2018

221 views

Category:

Documents


3 download

TRANSCRIPT

N. Calzolari 1Multilingual Web, Madrid, October 2010

Nicoletta CalzolariIstituto di Linguistica Computazionale - CNR - Pisa

[email protected]

Language Resources: a pillar of Language Technology

In the Multilingual Web perspective

N. Calzolari 2Multilingual Web, Madrid, October 2010

A new paradigm of R&D in LRs & LTSince few years

Open & distributed linguistic infrastructures for LRs & LT

Adopting the paradigm of accumulation of knowledge, so successful in more mature disciplines, based on sharing LRs, tools & results

Ability to build on each other achievements, allowing controlled & effective cooperation of many groups on common tasks (see HumanGenomeProject)

e. g. initiatives to achieve international consensus on annotation guidelines

Emerging concept of collective intelligence

Emphasize interoperability among LRs & LT

Some steps for a “new generation” of LRs

N. Calzolari 3Multilingual Web, Madrid, October 2010

From huge efforts building static, large-scale, general-

purpose LRs To dynamic LRs rapidly built on-

demand, tailored to specific user needs

From closed, locally developed and centralized

resources

To LRs residing over distributed places, accessible on the web, choreographed by

agents acting over them

From Language Resources To Language Services

Need of an infra that makes this vision operational

Lexical WEB

As a critical step for semantic mark-up in the Semantic Web

N. Calzolari 4Multilingual Web, Madrid, October 2010

ComLex

SIMPLE

WordNetsWordNets

WordNets

FrameNet

Lex_x

Lex_y

with intelligent

agents

NomLex

Standards for Content Interoperability

Enough??

Global WordNet GRID

BioLexicon

SIMPLE-WEB

Distributed Language Services

N. Calzolari 5Multilingual Web, Madrid, October 2010

content interoperability

standards

supra-national cooperation

architecturesenabling

accessibility

Collaborative & collective/social development & validation,

cross-resource integration & exchange of information

Create new resources on the basis of existing

Exchange & integrate

information across repositories

Compose new services on

demand

N. Calzolari Multilingual Web, Madrid, October 2010 6

Cultural issuesLanguage … and cultural identity

Economic, social issues

ApplicationsServices

Technical, scientific issues

Political issuese.g. a commonly agreed list of minimal

requirements for “national” LRs: BLARK

We need to consider togethertechnicalorganisational strategiceconomicculturallegalpolitical issues wrt LRs & LTs

EU Network FLaReNet

Sensitive

Why a Network of LRs & LTs?Many dimensions around the notion of language

Fostering Language Resources Network

FLaReNet at a glance

An international Forum to facilitate interaction, to

Overcome the fragmentation in LR & LT & recreate a community

Anticipate the needs of new types of

Language Resources (LR) and Technologies (LT) &

Language Infrastructures

Create a shared policy in the field of LRs & LT for the next years Foster a European strategy for consolidating the sector

7

http://www.flarenet.eu

N. Calzolari 7Multilingual Web, Madrid, October 2010

81 Institutional Members

From 31 countries

332 Individual

Subscribers Essential Community mobilisation

(also to prepare the ground for a RI)

8

FLaReNet Mission & Impact: structure the area of LRs of the future

A “roadmap”: a plan of actions as input to policy development

For the EU, national organisations & industry

Srengthen the language product market

Identify new language policies supporting linguistic diversity

Identify areas where consensus is achieved/emerging vs. areas where

more discussion & testing is required

Indicate priorities

A ( Eu) model for the LRs/LTs of the next years

Ambitious!N. Calzolari 8Multilingual Web, Madrid, October 2010

N. Calzolari Multilingual Web, Madrid, October 2010 9

Create a shared repository of data formats, annotations, etc. as amajor help to achieve standardisationCommon repositories for tools & language data should beestablished that are universally and easily accessible by everyoneCoordinate input to ISO/W3C standardisation work

Results from Vienna & Barcelona Forum:Shaping the Future of the Multilingual Digital Europe

Standards, Interoperability & Metadata are topics to be approached in cooperation

Access to LRs is critical & should involve all the communityNeed to create the means to plug together different LR & LT,In a web-based resource and technology “grid”

For a new world-wide language infrastructure

N. Calzolari Multilingual Web, Madrid, October 2010 10

Which Communities?Language ResourcesLanguage TechnologiesStandardisationContent/OntologiesSystem developersIntegrators SSH…

ECNational funding agencies Industry…

Many tasks & applicationdomains

MTCLIR…e-governmentcontent industryintelligencee-culturee-healthdomotics…

core

EUForum

with

Focus on cooperation

Many LRs & LTs exist, but a global vision, policy & strategy is needed

for

CLARINfor SSHFLaReNet

Network

METANETNoE

META-SHAREan Open Resource Exchange and Sharing Facility

META-SHARE : an open, integrated, secured, and interoperable

language data and tools exchange facility for the HLT (Human Language

Technologies) domain and other applicative domains (e.g. digital libraries,

cognitive systems, robotics, etc)

− ever-evolving, scalable, incl. free and for-a-fee LRs/LTs and services;

− including legacy, contemporary and emerging datasets, tools and technologies

− based on distributed networked repositories accessible through common interfaces

− standards-compliant, overcoming format, terminological and semantic differences; allowing/enabling service offerings

− complying to legal and security related restrictions

A marketplace where language (and related) data and tools are documented,

uploaded and stored in repositories, catalogued and announced, downloaded,

exchanged, combined, etc. aiming to support a data economy

11Multilingual Web, Madrid, 2010N.Calzolari

On the communication/mobilisation side

A change of culture

Convincing arguments that data assets and their value do not necessarily grow if locked in the drawer

Incentives and models that can convince data holders that there is life after the announcement of data existence and/or sharing (share does not necessarily mean for free, nor for unbridled use)

Interoperability, common metadata, formats, etc.

In other words we need to create/reinforce a data economy based on widely agreed principles and rules, mutual understanding, sustainable and adaptive models, simplified copyright rules and licensing models

The present time window seems appropriate

Challenges

12N.Calzolari Multilingual Web, Madrid, 2010

Collaborative iResources

LR building as collaborative “common shared task”New methodology of work

Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages ( LRE Map)

Interoperability acquires even more valueNeeds consensual planning of common strategies towards shared objectives

Not just the sum of many individual effortsBut an organised, well-structured, collective enterpriseSimilar to more mature sciences: Physicists/Astronomers’sexperiments … of X,000 people working on the same big enterprise

N. Calzolari13 Multilingual Web, Madrid, October 2010

METASHARE is a big step that needs a real Paradigm shift

Collaborative iResources iKnowledge

Enhance “content” interoperability

Towards Knowledge Resources

In a Unified Framework for LRs & (old?) SemanticWeb

N. Calzolari14 Multilingual Web, Madrid, October 2010

Next step:

15

Everyone contributes to building the LRE Map!When submitting a paper provide info about resources used/created

At LREC, COLING, EMNLP, ....

A collective enterprise of the LR community, as a step towards the creation of a broad community-built, Open Resource Infrastructure

It will become an essential instrument to monitor the field & identify shifts in production, use, evaluation of LRs and LTs over the years, in adoption of standards, …

the LRE Mapwww.resourcebook.eu

N. Calzolari 15Multilingual Web, Madrid, October 2010

How many LRs at LREC?

Corpora: 785Lexicons: 289

Tagger/Parser: 181Annotation tool: 134

Ontology: 73Evaluation data: 40

Annotation Guidelines: 35

...

Submissions: 1288 Language Resource forms: 1994

1616 Multilingual Web, Madrid, October 2010N. Calzolari

Languages:But obviously …

17 Multilingual Web, Madrid, October 2010

170 !!

image courtesy of Wordle (http://www.wordle.net)

N. Calzolari

N. Calzolari

From no infrastructure ...To many infrastructures

We were complaining there was no infrastructure ...Have we been too successful??

Now many infrastructural initiatives

Very good opportunity

But only if we are able to act in a coordinated & coherent wayOtherwise we spoil & confuse the field

1818Multilingual Web, Madrid, October 2010