language resources for multilingual europe

12
META-NET has received funding from the EU’s Horizon 2020 research and innovation programme through the contract CRACKER (grant agreement no.: 645357). Formerly co-funded by FP7 and ICT PSP through the contracts T4ME (grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893) and META-NORD (grant agreement no.: 270899). Language Resources for Multilingual Europe Georg Rehm META-NET Network Manager – CRACKER Coordinator DFKI, Germany [email protected] LT Innovate Summit – LR Dialogue Workshop, Panel “Language Resource Supply” Brussels, Belgium, June 25, 2015

Upload: georg-rehm

Post on 16-Aug-2015

145 views

Category:

Technology


1 download

TRANSCRIPT

META-NET has received funding from the EU’s Horizon 2020 research and innovation programme through the contract CRACKER (grant agreement no.: 645357). Formerly co-funded by FP7 and ICT PSP through the contracts T4ME (grant agreement no.: 249119), CESAR (grant agreement no.: 271022), METANET4U (grant agreement no.: 270893) and META-NORD (grant agreement no.: 270899).

Language Resources for Multilingual Europe

Georg RehmMETA-NET Network Manager – CRACKER Coordinator

DFKI, [email protected]

LT Innovate Summit – LR Dialogue Workshop, Panel “Language Resource Supply”Brussels, Belgium, June 25, 2015

META-NET and META

q 

60 research centres in 34 countries(via four EU-funded projects: T4ME,CESAR, METANET4U, META-NORD)

q 

Multilingual Europe Technology Alliance,794 members in 68 countries

http://www.meta-net.eu/members

http://www.meta-net.eu

q  Pan-European infrastructure, bringing together providers and consumers of language data, tools and services.

q  LRs are documented, uploaded, stored, catalogued, downloaded, shared – to improve visibility, documentation, identification, availability, interoperability.

q  Caters for datasets, tools, services for LT research and development (both academic and commercial); META-SHARE includes repository software, a metadata model, licensing kit, statistics.

q  29 distributed repositories maintained by 37 organisations in 25 countries.

q  2.500+ resources (corpora: 49%, lexical: 38%, tools/services: 12%),covering ca. 100 languages.

q  7.000+ downloads in total; ca. 70% of all LRs have been downloaded.

MT

English

good

French, Spanish

moderate fragmentary

Catalan, Dutch, German, Hungarian, Italian, Polish, Romanian

weak or no support

Basque, Bulgarian, Croatian, Czech, Danish, Estonian, Finnish, Galician,

Greek, Icelandic, Irish, Latvian, Lithuanian, Maltese, Norwegian,

Portuguese, Serbian, Slovak, Slovene, Swedish, Welsh

excellent

English

good

Czech, Dutch, French, German, Hungarian,

Italian, Polish, Spanish, Swedish

moderate fragmentary

Basque, Bulgarian, Catalan, Croatian, Danish, Estonian, Finnish,

Galician, Greek, Norwegian, Portuguese, Romanian, Serbian,

Slovak, Slovene

Icelandic, Irish, Latvian, Lithuanian, Maltese, Welsh

weak/no supportexcellent

Res

ourc

es

Fragmentary

Weak/none

Moderate

Good

Excellent

Welsh

Maltese

Lithuanian

Latvian

Icelandic

Irish

Croatian

Serbian

Estonian

Slovene

Slovak

Roma

nian

Norwegian

Greek

Galician

Danish

Bulgarian

Basque

Swedish

Portu

guese

Finnish

Catal

anPo

lish

Hung

arian

Czech

Italia

nGe

rman

Dutch

Span

ishFre

nch

Engli

sh

Leve

l of s

uppo

rt

Languages with names in redhave little or no MT support

Language White Paper SeriesEurope’s Languages in the Digital Age (2011/2012)

Summary: “At Least 21 European Languages in Danger of Digital Extinction!”

http://www.cracker-project.eu • http://www.meta-net.eu

LR-Related Activities

2015 2016 2017

M12M1

M24

M36

Kick-off meetingfor all ICT-17Projects

translate5 WMT2016

WMT2017

IWSLT2015

IWSLT2016

IWSLT2017

QT Marathon2015

QT Marathon2016

Roadmap forEuropean MT

Research

Survey on the Stateof HQMT in Industry

and LSPs

SRIA(initial version)

SRIA(update)

SRIA(final)

version 2version 1

•  Production of resources (e.g., for WMT 2016 and 2017, IWSLT 2015-2017)

•  Tools for resources (quality control, evaluations; towards the idea of a smart workbench for translators)

•  Strategies and roadmaps for resources (SRIA, Roadmap for European MT Research)

•  Exchange and sharing facility for resources (META-SHARE)

Maintenance of Operations and Outreach •  Provide services, adapt them to evolving user requirements and licensing landscape

•  adapt, streamline and extend the metadata schema; •  adapt licensing toolkit to new international licensing setups; •  streamline and simplify operations for repository providers and data depositors.

•  Technical support and bug fixing

http://www.cracker-project.eu • http://www.meta-net.eu

•  Federation of projects – core seed: the group of H2020-ICT17 projects.

•  Multi-lateral Memorandum of Understanding, ca. 20 projects in total (including FP7 and H2020-ICT15), to be approached in two phases (first phase almost completed).

•  Selected areas of collaboration: data management and repositories (including Data Management Plan), tools and technologies; shared tasks and evaluations.

•  http://www.cracking-the-language-barrier.eu will be launched soon.

MT Use Cases and Language Resourcesq  “Usability” is an unusual generic dimension for the evaluation of a resource. q  Reason: the majority of LRs can be used in many different research or application scenarios.q  More relevant dimensions: quality, availability, coverage, maturity, sustainability, adaptability,

size, format, license, language, style etc. – depending on the use case.q  When talking about LRs for MT, it’s important to be specific in terms of the respective use case. q  Reason: the use case puts specific requirements on the type of LR and relevant dimensions.

Scenario MT Use Case Maturity of Technology

Human Involvement

Relevance of Quality Methods LR Requirements

Inbound Translation (written texts)

Gist transla-tion, provide an idea of a text’s contents

Deployed (Google Translate), research ongoing

– Quality of MT secondary

Statistical MT Very large aligned data sets (the more data, the better)

Outbound Translation (written texts)

Production quality, for publication

Research on HQMT has started, no POCs yet

– Quality of MT extremely important, ideally HQ

New approach needed, SMT, RBMT, hybrid systems (needs quality estimation methods)

Deeply annotated data sets with quality information (also needs more research)

Outbound Translation (written texts)

Production quality, for publication

Deployed, usable via LSPs

Post-editing Quality of initial MT step important but secondary

MT, followed by post-editing, ideally with smart translation workbenches (CAT)

Translation memories and term databases (large coverage, high quality etc.)

Speech to Speech Translation

Enable face-to-face conversations

Research ongoing but POCs exist (Skype)

– Quality of MT secondary

Recognition and generation of spoken language; statistical MT etc.

Several additional technologies and LR types needed (such as very large speech databases)

http://www.meta-net.eu 8

META-NET SRA LR Roadmap

q  Infrastructure – maintain and extend sharing facility; promote documentation through metadata; intensify cooperation

q  Coverage, Quality, Adequacy – increase number of LRs for all European languages to address application needs; promote evaluation and validation to improve LR quality constantly

q  Acquisition – define best practices for LR production; automate production; distributed production (crowd-sourcing, social media, gamification etc.); bridge acquisition methods with LOD, big data

q  Openness – elaborate simple and har-monised licensing solutions; promote openness and sharing of LRs

q  Interoperability – promote and encourage use of standards

FLaReNet is a project funded under the eContentplus programme, grant agreement ECP-2007-LANG-617001. eContentplus is a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable.

The Strategic Language Resource Agenda

Nicoletta Calzolari, Valeria Quochi, Claudia Soria

CNR - Istituto di Linguistica Computazionale “A. Zampolli”, Italy

with the contribution of

Núria Bel, University Pompeu Fabra, Spain

Gerhard Budin, Universität Wien, Austria

Khalid Choukri, ELDA, France

Joseph Mariani, LIMSI/IMMI-CNRS, France

Monica Monachini, CNR-ILC, Italy

Jan Odijk, Universiteit Utrecht, Netherlands

Stelios Piperidis, ILSP/”Athena” R.C., Greece

http://www.meta-net.eu

We need an LT Masterplan

q  In 2015, LT is simply everywhere: search, interactive assistants (phones, cars, appliances), big data, social media analytics, etc. The potential is huge!

q  Europe needs to follow a Language Technology Masterplan. Resources are only one piece of the puzzle, also needs to reflect technologies, tools, research, innovation, platforms, infrastructures, services, language policy making, the language communities, flagship initiatives (CEF, DSM), etc.

q  Europe is only starting to recognise the potential of LT.

q  LT will be a key ingredient of our future IT – with or without Europe.

q  Europe has a unique opportunity for a strategic investment into our future growth.

http://www.meta-net.eu

DECLARATION OF COMMON INTERESTS We, the undersigned, declare here, at the Riga Summit on the Multilingual Digital Single Market, encouraged by the letter Vice President Andrus Ansip sent to its participants, that we stand united in our goal and interest to:

- support multilingualism in Europe by employing language technology in business, society and governance, to create a truly Multilingual Digital Single Market,

- exchange and share information in our efforts to promote our goals and interests at local, national and European levels,

- raise awareness in society at large using channels available to our associations, alliances and societies.

In the near future, we foresee the establishment of a Memorandum of Understanding among our organisations towards a “Coalition for a Multilingual Europe”, to better serve our members address the language barrier challenges towards establishing a truly integrated Multilingual Digital Single Market.

Riga, 29. April 2015

Signed by (in alphabetical order):

BDVA Laure Le Bars

CITIA Steve Renals

CLARIN Steven Krauwer

EFNIL Sabine Kirchmeier-Andersen, Tamás Váradi

ELEN Davyth Hicks, Claudia Soria

ELRA Nicoletta Calzolari, Khalid Choukri

GALA Laura Brandon, Robert E. Etches, Sergey Gladkov

LT Innovate Jochen Hummel, Philippe Wacker

META-NET Jan Hajic, Josef van Genabith, Georg Rehm, Andrejs Vasiljevs

NPLD Meirion Prys Jones

TAUS Jaap van der Meer W3C Richard Ishida, Felix Sasaki

For any questions, please contact [email protected].

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFT

DRAFTStrategic Agenda for the

Multilingual Digital Single Market

Technologies for Overcoming Language Barriers towardsa truly integrated European Online Market

DRAFT

Version 0.5 – April 22, 2015

The key ingredients are in place: the communities are ready, several strategic research agendas were prepared, e.g.,:

10

META-NET SRA MDSM SRIA Riga Summit Declaration

Enable multilingual communication through web scale platform (also: Multi-

lingual Digital Single Market)

Software engineering project; “one size fits all” approach; low risk of failure; increased security and data protection

Web service (including APIs) that makes use of SMT

methods and large data sets

Web service platform for LT/MT research and innovation (hybrid research, continuous development and operations)

Enable the testing of new methods and avantgarde

approaches with very large amounts of users

European research and innovation platform for novel LT/MT ideas and specialised services (e.g., genres, styles,

registers etc.)

Translingual Cloud

Web service platform for human translators and LSPs

Enable hand-in-hand operations of MT and human

translation; enable high-quality human translation

Establish a sustainable technological link between

human and machine (e.g., via human-generated and

human-annotated data sets)

http://www.meta-net.eu 11

Thank you!

http://www.meta-net.euhttp://www.facebook.com/META.Alliance

12