language resources: a pillar of language technology
TRANSCRIPT
N. Calzolari 1Multilingual Web, Madrid, October 2010
Nicoletta CalzolariIstituto di Linguistica Computazionale - CNR - Pisa
Language Resources: a pillar of Language Technology
In the Multilingual Web perspective
N. Calzolari 2Multilingual Web, Madrid, October 2010
A new paradigm of R&D in LRs & LTSince few years
Open & distributed linguistic infrastructures for LRs & LT
Adopting the paradigm of accumulation of knowledge, so successful in more mature disciplines, based on sharing LRs, tools & results
Ability to build on each other achievements, allowing controlled & effective cooperation of many groups on common tasks (see HumanGenomeProject)
e. g. initiatives to achieve international consensus on annotation guidelines
Emerging concept of collective intelligence
Emphasize interoperability among LRs & LT
Some steps for a “new generation” of LRs
N. Calzolari 3Multilingual Web, Madrid, October 2010
From huge efforts building static, large-scale, general-
purpose LRs To dynamic LRs rapidly built on-
demand, tailored to specific user needs
From closed, locally developed and centralized
resources
To LRs residing over distributed places, accessible on the web, choreographed by
agents acting over them
From Language Resources To Language Services
Need of an infra that makes this vision operational
Lexical WEB
As a critical step for semantic mark-up in the Semantic Web
N. Calzolari 4Multilingual Web, Madrid, October 2010
ComLex
SIMPLE
WordNetsWordNets
WordNets
FrameNet
Lex_x
Lex_y
with intelligent
agents
NomLex
Standards for Content Interoperability
Enough??
Global WordNet GRID
BioLexicon
SIMPLE-WEB
Distributed Language Services
N. Calzolari 5Multilingual Web, Madrid, October 2010
content interoperability
standards
supra-national cooperation
architecturesenabling
accessibility
Collaborative & collective/social development & validation,
cross-resource integration & exchange of information
Create new resources on the basis of existing
Exchange & integrate
information across repositories
Compose new services on
demand
N. Calzolari Multilingual Web, Madrid, October 2010 6
Cultural issuesLanguage … and cultural identity
Economic, social issues
ApplicationsServices
Technical, scientific issues
Political issuese.g. a commonly agreed list of minimal
requirements for “national” LRs: BLARK
We need to consider togethertechnicalorganisational strategiceconomicculturallegalpolitical issues wrt LRs & LTs
EU Network FLaReNet
Sensitive
Why a Network of LRs & LTs?Many dimensions around the notion of language
Fostering Language Resources Network
FLaReNet at a glance
An international Forum to facilitate interaction, to
Overcome the fragmentation in LR & LT & recreate a community
Anticipate the needs of new types of
Language Resources (LR) and Technologies (LT) &
Language Infrastructures
Create a shared policy in the field of LRs & LT for the next years Foster a European strategy for consolidating the sector
7
http://www.flarenet.eu
N. Calzolari 7Multilingual Web, Madrid, October 2010
81 Institutional Members
From 31 countries
332 Individual
Subscribers Essential Community mobilisation
(also to prepare the ground for a RI)
8
FLaReNet Mission & Impact: structure the area of LRs of the future
A “roadmap”: a plan of actions as input to policy development
For the EU, national organisations & industry
Srengthen the language product market
Identify new language policies supporting linguistic diversity
Identify areas where consensus is achieved/emerging vs. areas where
more discussion & testing is required
Indicate priorities
A ( Eu) model for the LRs/LTs of the next years
Ambitious!N. Calzolari 8Multilingual Web, Madrid, October 2010
N. Calzolari Multilingual Web, Madrid, October 2010 9
Create a shared repository of data formats, annotations, etc. as amajor help to achieve standardisationCommon repositories for tools & language data should beestablished that are universally and easily accessible by everyoneCoordinate input to ISO/W3C standardisation work
Results from Vienna & Barcelona Forum:Shaping the Future of the Multilingual Digital Europe
Standards, Interoperability & Metadata are topics to be approached in cooperation
Access to LRs is critical & should involve all the communityNeed to create the means to plug together different LR & LT,In a web-based resource and technology “grid”
For a new world-wide language infrastructure
N. Calzolari Multilingual Web, Madrid, October 2010 10
Which Communities?Language ResourcesLanguage TechnologiesStandardisationContent/OntologiesSystem developersIntegrators SSH…
ECNational funding agencies Industry…
Many tasks & applicationdomains
MTCLIR…e-governmentcontent industryintelligencee-culturee-healthdomotics…
core
EUForum
with
Focus on cooperation
Many LRs & LTs exist, but a global vision, policy & strategy is needed
for
CLARINfor SSHFLaReNet
Network
METANETNoE
META-SHAREan Open Resource Exchange and Sharing Facility
META-SHARE : an open, integrated, secured, and interoperable
language data and tools exchange facility for the HLT (Human Language
Technologies) domain and other applicative domains (e.g. digital libraries,
cognitive systems, robotics, etc)
− ever-evolving, scalable, incl. free and for-a-fee LRs/LTs and services;
− including legacy, contemporary and emerging datasets, tools and technologies
− based on distributed networked repositories accessible through common interfaces
− standards-compliant, overcoming format, terminological and semantic differences; allowing/enabling service offerings
− complying to legal and security related restrictions
A marketplace where language (and related) data and tools are documented,
uploaded and stored in repositories, catalogued and announced, downloaded,
exchanged, combined, etc. aiming to support a data economy
11Multilingual Web, Madrid, 2010N.Calzolari
On the communication/mobilisation side
A change of culture
Convincing arguments that data assets and their value do not necessarily grow if locked in the drawer
Incentives and models that can convince data holders that there is life after the announcement of data existence and/or sharing (share does not necessarily mean for free, nor for unbridled use)
Interoperability, common metadata, formats, etc.
In other words we need to create/reinforce a data economy based on widely agreed principles and rules, mutual understanding, sustainable and adaptive models, simplified copyright rules and licensing models
The present time window seems appropriate
Challenges
12N.Calzolari Multilingual Web, Madrid, 2010
Collaborative iResources
LR building as collaborative “common shared task”New methodology of work
Assemble a comprehensive “map of language data and mechanisms” for the planet’s languages ( LRE Map)
Interoperability acquires even more valueNeeds consensual planning of common strategies towards shared objectives
Not just the sum of many individual effortsBut an organised, well-structured, collective enterpriseSimilar to more mature sciences: Physicists/Astronomers’sexperiments … of X,000 people working on the same big enterprise
N. Calzolari13 Multilingual Web, Madrid, October 2010
METASHARE is a big step that needs a real Paradigm shift
Collaborative iResources iKnowledge
Enhance “content” interoperability
Towards Knowledge Resources
In a Unified Framework for LRs & (old?) SemanticWeb
N. Calzolari14 Multilingual Web, Madrid, October 2010
Next step:
15
Everyone contributes to building the LRE Map!When submitting a paper provide info about resources used/created
At LREC, COLING, EMNLP, ....
A collective enterprise of the LR community, as a step towards the creation of a broad community-built, Open Resource Infrastructure
It will become an essential instrument to monitor the field & identify shifts in production, use, evaluation of LRs and LTs over the years, in adoption of standards, …
the LRE Mapwww.resourcebook.eu
N. Calzolari 15Multilingual Web, Madrid, October 2010
How many LRs at LREC?
Corpora: 785Lexicons: 289
Tagger/Parser: 181Annotation tool: 134
Ontology: 73Evaluation data: 40
Annotation Guidelines: 35
...
Submissions: 1288 Language Resource forms: 1994
1616 Multilingual Web, Madrid, October 2010N. Calzolari
Languages:But obviously …
17 Multilingual Web, Madrid, October 2010
170 !!
image courtesy of Wordle (http://www.wordle.net)
N. Calzolari
N. Calzolari
From no infrastructure ...To many infrastructures
We were complaining there was no infrastructure ...Have we been too successful??
Now many infrastructural initiatives
Very good opportunity
But only if we are able to act in a coordinated & coherent wayOtherwise we spoil & confuse the field
1818Multilingual Web, Madrid, October 2010