deliverable 8.2b blueprint of actions and...

33
D8.2b Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001 FLaReNet Deliverable 8.2b Blueprint of Actions and Infrastructures eContentplus This project is funded under the eContentplus programme 1 , a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable. 1 OJ L 79, 24.3.2005, p. 1. Deliverable number/name D8.2b Blueprint of Actions and Infrastructures Dissemination level Public Delivery date 15 September 2010 Status Final Author(s) Nicoletta Calzolari, Claudia Soria, Valeria Quochi, Núria Bel, Gerhard Budin, Tommaso Caselli, Khalid Choukri, Joseph Mariani, Monica Monachini, Jan Odijk, Stelios Piperidis

Upload: others

Post on 30-Apr-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

1

ECP-2007-LANG-617001

FLaReNet

Deliverable 8.2b

Blueprint of Actions and Infrastructures

eContentplus

This project is funded under the eContentplus programme1, a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable.

1 OJ L 79, 24.3.2005, p. 1.

Deliverable number/name

D8.2b – Blueprint of Actions and Infrastructures

Dissemination level Public

Delivery date 15 September 2010

Status Final

Author(s)

Nicoletta Calzolari, Claudia Soria, Valeria Quochi, Núria Bel, Gerhard Budin, Tommaso Caselli, Khalid Choukri, Joseph Mariani, Monica Monachini, Jan Odijk, Stelios Piperidis

Page 2: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

2

Table of Contents

TABLE OF CONTENTS .......................................................................................................................................... 2

1 INTRODUCTION: LANGUAGE RESOURCES AND TECHNOLOGIES: A BLUEPRINT FOR ACTION IN EU AND BEYOND .................................................................................................................................................. 3

2 RECOMMENDATIONS AT A GLANCE ....................................................................................................... 5

DIRECTION 1: INFRASTRUCTURAL ASPECTS ................................................................................................. 9

ISSUE 1.1: FOUNDATIONS OF AN INFRASTRUCTURE OF LANGUAGE RESOURCES AND LANGUAGE TECHNOLOGIES ............. 9 ISSUE 1.2: METADATA ......................................................................................................................................... 13 ISSUE 1.3: DOCUMENTATION................................................................................................................................ 15 ISSUE 1.4: STANDARDS ........................................................................................................................................ 17 ISSUE 1.5: LEGAL, IPR ISSUES .............................................................................................................................. 19 ISSUE 1.6: EVALUATION ....................................................................................................................................... 22

DIRECTION 2: RESEARCH AND DEVELOPMENT ........................................................................................... 24

ISSUE 2.1: A REFERENCE MODEL FOR CREATING THE LANGUAGE RESOURCES OF THE FUTURE .................................... 24

DIRECTION 3: POLITICAL AND STRATEGIC DIMENSIONS .......................................................................... 29

ISSUE 3.1: (INTERNATIONAL) COOPERATION ......................................................................................................... 29 ISSUE 3.2: FUNDING AGENCIES POLICIES ............................................................................................................... 31 ISSUE 3.3: LR CITATION....................................................................................................................................... 33

Page 3: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

3

1 Introduction: Language Resources and Technologies: a Blueprint of Actions in EU and beyond

This document is the second in the series of Blueprint of Actions and Infrastructures, and consists in a substantial revision of the previous Blueprint. The document synthesizes the outcomes of the discussions held both within and outside FLaReNet during the second year of activity and integrates the achievements of the first year. It outlines a preliminary picture of where we are now in the field of language resources (LRs), of where we wish to be in five to ten years time and how we could get there.

The content of this document has to be understood as the expression of discussions and ideas of the community revolving around LRs. Of course, some individual positions may not be reflected as our goal here is mainly to express shared visions for constructing our future.

Like for the first document, the addressees of this Blueprint belong to a large set of players and stakeholders in Language Resources and Technologies (LRTs), ranging from individuals to research and education institutions, to policy-makers, funding agencies, SMEs and large companies, service and media providers. Its main goal is thus to serve as an instrument (or “blueprint”) to support organizations, institutions, funding agencies, companies, and individuals in planning for and addressing the urgencies of the LRTs of the future. The recommendations contained in the present document should therefore be taken into account by any player, whether on a European, National, local, or private level, wishing to draft a program of activities for his/her own communities.

It covers a broad range of topics and activities, spanning over production and use of language resources, licensing, maintenance and preservation issues, infrastructures for information and resource identification and sharing, evaluation and validation, interoperability and policy issues.

Language Resources under consideration herein cover all language data sets and basic tools, from speech collection, textual corpora, images, audio-visual recordings, sign language collections, taggers, parsers, annotation tools, etc.

Recognizing that the development of the sector of LRTs is conditioned by various factors, all interested stakeholders need to operate seriously together and forge partnerships to push LRTSs. FLaReNet has already tackled this issue by bringing different stake-holders together who started discussing about several key topics, such as the lack of infrastructures for the domain, and trying to foster joint plans, projects and roadmaps. Some early results are already visible: an EC funded Network of Excellence has started in 2010 – META-NET – with one of its main goals being the design and set-up of an infrastructure for language resource and technology sharing at large – META-SHARE. This was the main recommendation of FLaReNet in its first year.

Together, and under the umbrella of a shared view of today’s priorities, a future can be shaped in which full deployment of Language Resources and Technologies is consolidated through coordination of programs, actions and activities. While there has been considerable progress in technology developments in the last decade, the significant challenge of overcoming current fragmentation and imbalance inside the LRTs community for all languages still remains to be faced.

Page 4: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

4

We consider the process of developing this Blueprint as important as the resulting document itself; it is through the act of engaging the community in discussing, challenging, justifying, and reconciling their individual and collective views, experience, and concerns that the Blueprint has come to fruition.

As such, the production process of the FLaReNet Blueprints collection is the result of a permanent and cyclical consultation that FLaReNet has initiated inside the community it represents – with more than 300 members – and outside it, through connections with neighbouring projects, associations, initiatives, funding agencies and government institutions.

As a result of this process, these Blueprints reflect a consensus on the active living interests, needs, concerns, and values of a wide spectrum of players in LRTs.

This Blueprint is organised along three main “directions”: Infrastructural Aspects, Research and Development, Political and Strategic Issues. They reflect three major development directions that can boost or hinder the growth of the field of Language Resources and Technologies.

Infrastructural aspects regard all those issues related to the set up and functioning of an infrastructure for Language Resources, such as metadata, production models, archiving, standardisation, documentation, sharing, distribution, repackaging of LRs as well as set up and use of evaluation infrastructures.

R&D aspects are related to the development of new paradigms for the cost-effective production of high quality (new types of) Language Resources but also to the development and deployment of killer-applications to highlight the vital contribution of LRs.

Political and Strategic aspects regard the definition and implementation of long-term orientations to ensure the constitution of a joint European strategy on Language Resources shared by all countries, acting on multiple levels such as the consolidation of efficient infrastructures as designed by the community, the promotion of international cooperation, and securing adequate funding.

Altogether these directions are intended to contribute to the creation of a sustainable LRT ecosystem.

Page 5: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

5

1.1 Plan for the Roadmap: next steps

In the third year of activity, the content of this deliverable will be opened again for improvement and validation to the community which will be called to form a consensus on the top priorities. As a result of this community consultation a document will be produced that will be then sent to FLaReNet institutional members for adoption and endorsement.

Meanwhile, expert groups will be put together to develop a Roadmap on the basis of the content of this Blueprint together with other FLaReNet specific reports and new meetings’ results. Discussions and recommendations on the various issues, challenges and actions covered by this report will be organised. Finally, at or before the next FLaReNet Forum, experts will sit together and work towards the definition of a concrete roadmap for the following five-ten years that would lead to an empowerment and improvement of the LRTs sector.

The outcomes of these expert groups will be synthetically presented at the end of the Forum and the outcome will be refined to feed into the final document of the series: The FLaReNet Roadmap for LRTs.

2 Recommendations at a glance

The following tables synthesize – for the three directions – the main challenges and the corresponding recommendations that have been elaborated during the second year of activity and that are briefly described in the rest of the document.

Page 6: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

6

Direction Issue Challenge Recommended actions

Infr

ast

ruct

ura

l a

spe

cts

Foundations of an

infrastructure of LRTs

Principles of an infrastructure

Create consensus on core operational features

Guarantee agreed common BLaRK(s)

Find ways to attract huge numbers of users

Sustainability of infrastructure

Ensure implementation of sustainable infrastructure

Ensure sustainability of Language Resources

Common basic services and functionalities for enabling/ increasing visibility and usability of LRTs

Build and share basic tools for metadata repository creation

Exploit, reuse, build upon existing repositories and catalogues

Develop simple mechanisms for accessing/browsing, discovery/identification, sharing

Metadata

Interoperability of Metadata sets

Set up a global infrastructure of common and uniform and/or interoperable metadata sets

Metadata usable both by humans and by machines

Create machine-understandable metadata with formal syntax and clear semantics

Automate the process of metadata creation

Develop structured metadata

Documentation

Reliable documentation of LRs according to common best practices

Collect all possible and existing LR documentation

Devise and adopt a widely agreed standard documentation template for all types of resources

Standards

An interoperability framework for LRTs

Invest in standardisation activities

Identify new mature areas for standardisation and promote joint efforts between R&D and industry

Make standards operational and put them in use

Legal, IPR issues

Legal framework adequate to support sharing of resources world-wide

Educate the key players with basic legal know-how

Elaborate specific, simple, and harmonised licensing solutions for data resources

Promote the adoption of a “fair use act” (of copyrighted resources) at the European Level

Evaluation

A framework to take care of LRT evaluation in Europe

Establish common and standard Language Technology evaluation procedures

Devise new methods for LR quality check

Creation of an infrastructure for coordinated LRTs evaluation

Page 7: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

7

Direction Issue Challenge Recommended actions

R&

D

A reference model for

creating LRs of the future

Sufficient Language Resources for all languages

Implement BLaRKs for all languages, especially less-resourced languages

More high quality resources Provide sustainable high-quality resources for all European languages

Improve best practices to ensure good quality annotations

Boost semantic annotation

Encourage applications based on strong theoretical foundations, avoiding addressing only short-term development issues

Resources creation on demand and at affordable costs

Encourage full automation of LR-data production

“Go green”: enforce recycling of LRs, i.e. favour re-use and re-purposing

Resource building through shared and/or new social means

Invest in Web 2.0/3.0 methods for collaborative creation and extension of high-quality LRs, also for BLaRKs creation

Start an open community-effort initiative for a large Language Knowledge Repository

Foster the debate and experiments on new outsourcing trends over the web

Page 8: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

D8.2b – Blueprint of Actions and Infrastructures

8

Direction Issue Challenge Recommended actions

Po

liti

cal

an

d S

tra

teg

ic d

ime

nsi

on

s

(International) Cooperation

World-wide cooperative and coordinated programs

Boost cooperation among countries and programs

Organise community-wide cooperation among infrastructural initiatives

New reference models for LR production

Encourage shared constructions of resources as a means to achieve better coverage

Develop a mixed-funding framework (with national/regional funders, the EC, science and industry joining forces)

Funding Agencies policies

Devise models to allow different types of players easy access to resources

Ensure that publicly funded resources are publicly available either free of charge or at a small distribution cost

Encourage/enforce use of best practices or standards in LR production projects

Make sustainability and sharing/distribution plans mandatory in projects concerning LR production

LR citation

Appropriate citation of Language Resources like traditional publications

Develop a standard protocol for citing language resources

Page 9: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

9/33

Direction 1: Infrastructural Aspects

Issue 1.1: Foundations of an infrastructure of Language Resources and Language Technologies

Issue Challenge Recommended actions

Foundations of an infrastructure of

LRTs

Principles of an infrastructure Create consensus on core operational features

Guarantee agreed common BLaRK(s)

Find ways to attract huge numbers of users

Sustainability of infrastructure

Ensure implementation of sustainable infrastructure

Ensure sustainability of Language Resources

Common basic services and functionalities for enabling/ increasing visibility and usability of LRTs

Build and share basic tools for metadata repository creation

Exploit, reuse, build upon existing repositories and catalogues

Develop simple mechanisms for accessing/browsing, discovery/identification, sharing

At present, existing resources very often are difficult to access for various reasons. There are a number of valuable and useful resources that are accessible, downloadable or purchasable from different sources and in different ways. Some are available through distribution centres (notably ELRA and LDC), others from portals of projects or associations, others directly from the web pages of the laboratories or researchers who developed the resource, others still on request from the owner. In the current state of affairs researchers must still consult multiple catalogues with different approaches, structures and terms, wasting time and sometimes failing to find relevant LRs. In many of such cases, unless the potential user already knows something about the resource he might want to use (its name, owner, project, etc.), it would be difficult to discover new or unknown resources. Enabling identification and discovery of “missing” resources is a priority, together with enabling the new trend and promoting the new culture of sharing and collaborative creation of resources.

Additionally, many resources are instead not available any longer or seem to have disappeared. This is mainly because small labs and individual researchers are not very interested in depositing/storing and sharing resources. The main reason is the lack of incentives in doing so, as well as lack of rewards for researchers and their institutions. Preservation and maintenance are serious problems in the current situation. For some languages, high-level applications have been developed and have gone lost over the years. For example a TTS application for Gaelic was developed in the mid nineties, but the tool is now lost, as it has never been archived outside the team that developed it (which no longer exists).

The picture is even more complicated due to the issue of intellectual property rights of LRs that are often not simple to attribute and distribution/use rights are therefore difficult to define in appropriate and safe ways.

Page 10: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

10/33

In order to overcome these problems, a few distinct infrastructural initiatives are emerging. This could however have the undesired effect of increasing the risk of fragmentation of the community, if no proper coordination and synergies are established. Another major problem is that most today “infrastructures” are projects, with limited time duration, still without broad involvement and endorsement of the community.

Challenge 1.1.1: Principles of an infrastructure The basic principles of an infrastructure for language resources and technologies1 need to be established at community level, possibly bringing together and building on the various ongoing experiences.

This entails a definition and common agreement on the basic criteria and dimensions for a proper governance: of the main addressees, of the services and functionalities (basic and advanced) of such an infrastructure, as well as the definition of the basic data and software resources that should start populating the infrastructure. Multilingual coverage, the capacity to attract providers of useful and usable resources, an improvement of the sharing mechanisms and of the collaborative ways of work between the R&D and commercial users are among the building criteria of a common infrastructure, together with the ability to offer a setting that enhances ease of commercial exploitation of resources, ease of licensing with proper governance, ease of access, ease of conversion onto uniform formats, etc. An essential property is that it should be pragmatic: that is, really useful and widely used. It should also be discussed and decided whether the infrastructure should be community-driven (conceptually similar to Wikipedia, or YouTube).

Recommended Actions 1 Create consensus on core operational features

Operational specifications must follow from the basic principles, such as:

Define consensual exchange formats and description requirements for resources and services.

Define (and implement) core functionalities: starting with basic ones (e.g. download, upload, search, storage, etc.) and then more advanced ones (like versioning, workflows, online shops, but also compilation of best-practices, offer of services related to LRs, etc).

2 Guarantee agreed common BLaRKs

Collect/Create Language Resources (sufficient in quantity and quality) for each language.

Collect/Develop (open source) tools for each language (esp. statistically based tools that can be trained for different languages).

1 This is an ongoing activity within META-SHARE.

Page 11: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

11/33

3 Find ways to attract huge numbers of users

The infrastructure’s success is tied to the extent to which it is able to attract huge numbers of users. The basic principles for the construction of the infrastructure should be defined in such a way that users can immediately understand the advantages it brings them. The benefits for the users must be properly highlighted.

Design, at the community level, the right infrastructure also taking into account its impact factor and ensure endorsement by key players.

Challenge 1.1.2: Sustainability of infrastructure When setting up an infrastructure the community has to find ways of ensuring its sustainability: i.e. its durability and maintenance over time.

Currently resource (data and software) building and distribution is mainly project-oriented, a strategy that frequently leads to a loss of resources once the projects end.

Collecting and preserving knowledge, i.e. the LRs created, should be a top priority task.

The challenge here is to carefully establish rules for allowing adequate archiving and preservation without discouraging resource creation and sharing.

Recommended Actions 1 Ensure implementation of sustainable infrastructure

Sustainability has different aspects (preservation, accessibility, operability among others) that are not mutually exclusive but influence each other. A top priority is to develop an analytic model of sustainability in which extrinsic and intrinsic factors are taken into account, together with new modalities of sharing and collaborative working.

The model must be followed by appropriate initiatives towards its implementation.

2 Ensure sustainability of Language Resources

Long-term access to LRs must be a priority. This implies ensuring:

Appropriate means for data archiving and preservation, by the production unit but also archived off-site (e.g. archiving/data centres with longevity records).

Appropriate means for maintenance of LRs.

Sustainability of linguistic tools and resources, e.g. by requesting accessibility and usability of resources for a given time frame.

Page 12: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

12/33

Challenge 1.1.3: Common basic services and functionalities for enabling/ increasing visibility and usability of LRTs An important challenge that needs to be faced soon is the implementation and availability through the infrastructure of basic services and functionalities that give the infrastructure and the resources wide visibility and easy access.

Recommended Actions 1 Build and share basic tools for metadata repository creation

Ensure that a “repository software package” (that allows to set-up a repository/archive/catalogue of LRs) is made available for LRTs community in open source/free mode.

Ensure that such package allows for the use of common metadata schemas.

Ensure that all archiving centres using such software are interconnected easily.

2 Exploit, reuse, build upon existing repositories and catalogues

Cataloguing efforts should be internationally coordinated. Existing repositories such as ELRA Universal Catalogue, LDC, SHACHI must represent the common ground on which services and functionalities are to be built.

Ensure that such repositories are interoperable, in particular through mechanisms of metadata harvesting and aggregation, in order to make their records simultaneously available.

Ensure that searching/browsing functions are incorporated and the catalogues are linked and searchable from a single starting point.

Offer a package that connects such repositories with e-licensing, e-sharing, e-distribution, e-commerce, and other related services.

3 Develop simple mechanisms for accessing/browsing, discovery/identification, sharing

New means (such as the LRE Map1) for discovery and identification of relevant LRs that are not available through existing catalogues must be designed and consolidated. All the community must be involved in such initiatives

Simple mechanisms for searching/accessing information about resources and browsing/exchanging/sharing resources themselves must be designed and implemented.

1 http://www.resourcebook.eu/LreMap

Page 13: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

13/33

Issue 1.2: Metadata

Issue Challenge Recommended actions

Metadata

Interoperability of metadata sets

Set up a global infrastructure of common and uniform and/or interoperable metadata sets

Metadata usable both by humans and by machines

Create machine-understandable metadata with formal syntax and clear semantics

Automate the process of metadata creation

Develop structured metadata

Metadata is a hot topic both in LRTs infrastructures, such as the one envisaged in the previous section, and at a more general level of information accessibility.

One of the main responsibilities for the difficulty of finding resources that are appropriate for specific needs and languages is metadata incompatibility. Distinct sub-communities, data distribution centres, archiving institutions and projects, and the other various types of providers describe their data using their own metadata sets, very often different. Moreover, resources are not only described differently but also at different levels of granularity.

Another limitation of the present situation is that it is very difficult, if not impossible, to combine data from multiple sources to create new data sets for specific uses, an important functionality given that it is amply demonstrated that it is impossible to create one single good-for-all resource.

If we consider that the future internet will have huge amount of data continuously stored and accessed and that the amount is expected to grow by 10 times in the next 4 years, it becomes clear that the current technology is not going to work and that we need to rethink everything to this scale. We need to automate as much as possible. The key words for this future scenario thus are: scalability, trust, security, privacy, manageability, accessibility, usability, and representativity. In such a scenario metadata play a fundamental role, as they are the key technology for digital object management: discovery, composition, maintenance. So, the future internet will consist of services heavily depending on metadata. But, presently, most metadata sets are only machine-readable but not machine-understandable.

Challenge 1.2.1: Interoperability of Metadata sets The first priority and challenge in this field is then to make efforts towards a real interoperability of metadata sets. Because of multiple metadata sets and proliferation of search engines, harmonisation is a central problem that requires community-wide confrontation. Wide efforts to describe and catalogue language resources, such as the ELRA Universal Catalogue and the ELRA Catalogue1, the LDC catalogue2 and the collaboratively built LRE Map3 are good starting points. Abstract metadata hosting

1 http://universal.elra.info/search.php, http://catalog.elra.info/search.php

2 http://www.ldc.upenn.edu/Catalog/

3 http://www.resourcebook.eu/LreMap

Page 14: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

14/33

should become a reality, with different compliant catalogues displaying different views (records and fields) of the merged data.

Moreover, metadata must be open, freely available to everyone.

Recommended Actions 1 Set up a global infrastructure of common and uniform and/or interoperable

metadata sets

Agree on a core set of metadata for LRTs: harmonise metadata and descriptions; implement a standard set of well-documented metadata types, tags and relations (consisting of a super-set of existing metadata elements, instead of the usual common small set a la OLAC).

Ensure that large data centres are involved in the definition and adopt such common metadata set.

Encourage the development of sets of metadata that allow description of resource portions (and hence pave the way to combining parts of existing resources to build new ones).

Reinforce community-wide collaborative initiatives (like the LRE Map) encouraging individual involvement in resource description through a common set of metadata.

Challenge 1.2.2: Metadata usable both by humans and by machines It is important to establish a change in metadata culture and push toward the creation of machine-understandable metadata, i.e. pieces of information about (esp. digital) resources that can be automatically processed. This will make metadata browsable/accessible from various tools and for various purposes. The development of techniques to automate the process of metadata creation would also help to spread the adoption of machine-understandable metadata.

Recommended Action 1 Create machine-understandable metadata with formal syntax and clear

semantics

Since metadata are the key-component for digital objects discovery, composition and management, they must have formal syntax and declared – (at least) first order logic – semantics.

2 Automate the process of metadata creation

Promote research on the automation of the process of metadata creation.

3 Develop structured metadata

Build structured metadata sets, with relations among them.

Page 15: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

15/33

Issue 1.3: Documentation

Issue Challenge Recommended actions

Documentation Reliable documentation of LRs according to common best practices

Collect all possible and existing LR documentation

Devise and adopt a widely agreed standard documentation template for all types of resources

Documentation is what makes language resources usable by others than those who created/designed them, and it is what makes it possible to build new resources compatible with best practices/specifications. Documentation must include information about data format and data content, focusing also on the context of production and on actual/possible applications.

On the contrary, LRs are too often not well documented, or not documented at all. When available, the documentation is not easy to find.

Instead, users need information to:

a) Find a resource, assess its usefulness for a given application

b) Understand the production process, its use of best practices and its intended exploitation

c) Assess the quality of a resource

d) Replicate processes and results

e) Deal with idiosyncrasies or documented errors.

Machines also need (machine-understandable) information to:

a) Discover and compare resources

b) Validate formats and annotations

c) Appropriately process annotations

d) Possibly retrieve relevant parts of a resource for a given use

e) Other new usages not yet foreseen

Guidelines are useful instruments that enable replication and extension of resources. They may redirect language resource efforts in a more coherent way (see the experience in SPEECHDAT (www.speechdat.org) and the usage of its guidelines to develop similar resources for other languages well past the end of the project). Exhaustive and reliable guidelines should therefore exist for every resource type, building on the experience of successful projects.

Finally, even when documentation is provided, the various practices vary greatly, thus hampering dissemination and replication, as well as readability and comparability.

Page 16: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

16/33

Challenge 1.3.1: Reliable documentation of LRs according to common best practices Common best practices for documentation and guidelines writing need to be established and enforced. There is thus the need to develop standard specifications for documentation of resources.

Recommended Actions 1 Collect all possible and existing LR documentation

An effort must be made to collect all possible and existing LR documentation and make it easily available. To this end, design and build a (virtual) repository of specifications, guidelines, and documentation of LRs, starting with reference resource models or widely known and used resources (e.g. WordNet, Penn TreeBank, …).

2 Devise and adopt a widely agreed standard documentation template for all

types of resources

Given that documentation (as well as metadata) is part of the practice of LR creation, a common documentation template should be defined, and subsequently promoted at large and enforced in contracts for publicly funded projects.

Documentation should include:

a) A high-level description providing the non-expert, interested reader a good idea of what is in the resource, including general information such as owner/copyright holder, format and encoding of the data and the files, languages(s), domains, intended applications, applications in which the data was used, and quality details such as a basic quality assessment (in particular for availability/reliability of the encoded information).

b) Information on the theoretical framework, background, and/or the “philosophy” of the resource.

c) Specification of the methodology used to create the resource, specific enough to enable others to replicate the process.

d) Annotation specification (with data categories and their semantics) and annotation guidelines, i.e. guidelines used by annotators.

e) Information on adherence to standards (at all levels: production, annotation, validation, etc.).

f) Specification of the methodology or guidelines used to assess the quality of the resource (if validation is conducted) and the report on such validation.

g) Estimate of the efforts required to create the resource (in any reproducible unit, e.g. Person/Month).

Page 17: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

17/33

Issue 1.4: Standards

Issue Challenge Recommended actions

Standards

An interoperability framework for LRTs

Invest in standardisation activities

Identify new mature areas for standardisation and promote joint efforts between R&D and industry

Make standards operational and put them in use

Challenge 1.4.1: An interoperability framework for LRTs. Standards are the key for resource sharing, re-usability, maintainability and long-term preservation. However, there seems to be still little understanding of their need for the representation of data, and especially of the advantages their adoption might bring. Basic standardisation is thus lacking for many types of resources and many levels of information/annotations.

The fact that existing resources have mostly their own, different, representation formats and conventions makes their use by others difficult, as one has, first, to study and understand the format, and then build ad hoc conversions in order to use the data in his/her own activities. This especially hampers the possibility of a scenario where different sources are exploited to build resources on demand, which is on the other hand made possible by the current and near future web technologies. The lack of proper standardisation also makes it difficult to evaluate the quality and particular value of resources for a given application. Moreover, the requirements of so-called Less-Resourced Languages make the basic standardisation more crucial than with other languages.

A solution to these problems would be to work towards the establishment of a wide and agreed framework for interoperability of language resources and language technologies, seriously involving industry. Awareness should be spread that standards are necessary for resource producers/managers to 'join the club' of open access, which is becoming a trend, and to increase the utilization of their resources, consequently bringing them visibility, more users and therefore further funding.

The community and funding agencies need to join forces in order to push the use of standards, both existing and those in advanced state of preparation, at least for the areas in which some degree of consensus has been or can be reached (e.g. external descriptive metadata, meta models, POS and morpho-syntactic information, etc.). Only their actual use would lead to useful feedback and thus substantial improvement and advancement of the field.

The enforcement of the use of standards, however, cannot be done purely top-down. It should instead be accompanied with a specific view on, and contribution by, the different user communities. Moreover, as most users may not be interested in knowing that they are using standards, these should come equipped with tools that help users apply the standards while at the same time hiding most technicalities from them. Standards should operate in the background and they should be “inherent” to the language technology tools or more generic tools they use.

At the same time, there should be a regular analysis of new areas “mature” enough for starting a standardisation initiative.

Page 18: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

18/33

Recommended Actions 1 Invest in standardisation activities

Enforce and promote the use of standards at all stages: from basic standardisation for less-resourced languages (such as orthography normalization, transcription of oral data, etc.) to more complex areas (such as syntax, semantics, etc.).

2 Identify new mature areas for standardization and promote joint efforts

between R&D and industry

Industry involvement in standardisation initiatives is essential to ensure wide adoption of standards. A joint effort between academia and industry is to be promoted also in order to identify new areas that are mature for standardisation activities (such as semantic roles, and spatial language, for instance).

3 Make standards operational and put them in use

It is not enough to define standards, but it is important to have “usable” standards and to provide tools that enable their use in a transparent way.

Page 19: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

19/33

Issue 1.5: Legal, IPR issues

Issue Challenge Recommended actions

Legal, IPR issues

Legal framework adequate to support sharing of resources world-wide

Educate the key players with basic legal know-how

Elaborate specific, simple, and harmonised licensing solutions for data resources

Promote the adoption of a “fair use act” (of copyrighted resources) at the European Level

Challenge 1.5.1: Legal framework adequate to support sharing of resources world-wide

IPR issues are one of the key factors that may facilitate or hamper the evolution of our sector. This is a very delicate issue. On the one hand IPR, esp. authorship, need to be protected; on the other, the accessibility restrictions it poses often prevent real usability of language resources.

The dealing with legal issues, especially cross-country, is not adequate to support sharing of resources yet. If, on the one hand, LRs are to be legally protected against improper reuse, copy, modification etc., on the other the legislation about sharing resources is currently different in every country, even within the EU. The Berne Convention for the Protection of Library and Artistic Works extends copyright protections to creators in countries other than their own, but its enforcement is still a national issue and is therefore implemented in different ways.

Now the availability of huge quantities of data on the web and of technologies that permit to discover, download and collect them in useful resources creates a novel situation that poses even more problems from the legal perspective. Legislation is lagging behind technology. The current general trend is toward a free culture with less powerful rights holders. Creative Commons for example seems to be the most used license type for linguistic resources (see e.g. Google, Wikipedia, Whitehouse.gov, Public Library of Science, Flickr). There is also a debate on the open source approach, which has strong advantages especially for software products, while it is possibly less adequate for data. If on the one hand open source data could allow enterprises to easily exploit data in new applications without legal complications, on the other there would be less control of the resource and thus it is possible that the resources used have low quality. Additionally, open source options are not for LRs with proprietary content, because they imply source code-sharing and redistribution.

The LR community is faced now with important questions also about how to use blogs, newsgroups, web video, SMS and social networking sites, but there are virtually no laws, regulations or court decisions governing these modalities. The results now are that most resources are kept in-house for internal research or use only, or that individuals and organisations risk law suits because of some kind of infringement.

The challenge both for the LR & LT community and for policy makers here is to push the development of a common legal framework that would facilitate sharing of resources and efforts without breaking laws.

Page 20: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

20/33

Recommended Actions 1 Educate the key players with basic legal know-how

It is crucial that some legal knowledge is disseminated as part of the education of all (major) players in the LRT area.

It is also crucial to inform a number of lawyers about the community concerns so they can develop adequate frameworks to address such issues.

Moreover, it is important that such legal experts are asked to intervene at the initial phases of resource production, to ensure that all legal aspect but also ethical, privacy and other related aspects are taken into consideration with perspectives for long-term sharing and distribution.

2 Elaborate specific, simple, and harmonised licensing solutions for data

resources

Avoid one-for-all solutions: study different license ad hoc solutions for different resource types and user sub-communities. For instance, making use of existing raw data such as websites, blogs, and other data that can be easily collected without a clear consent of the right holders poses new problems of copyright infringements. This may empoison the relationship between our community and the publishers, media agencies, broadcasters, newspapers, bloggers, and other data aggregators.

In addition, a large number of licensing schemas are in use today. Some with backup from strong players (ELRA, LDC, open source communities such as Creative Commons, GPL, etc.), others drafted bilaterally and in some cases by legal department of some data providers. It is crucial that such licensing is harmonised and even standardised.

Simplify the licensing schemas through well agreed-upon solutions for R&D and for Industry.

Adopt electronic licensing (e-licenses) and adapt current distribution models to new media (web, mobile devices, etc.).

For mixed-funded initiatives (private/public), ensure that an agreement to make resources available at fair market conditions is designed from the start.

3 Promote the adoption of a “fair use act” (of copyrighted resources) at the

European Level

LRs must be legally protected against improper reuse, copy, modification etc. However, the legislation about sharing resources is currently different in every country, even within the EU.

Europe is lagging behind the US and Japan regarding the use of copyrighted resources for research purposes without the copyright holder consent (known in the US law as “the fair use of a copyrighted work”). This is not considered an infringement of copyright as long as the purpose of the use is for non-profit research/educational purposes and assessing the “amount and substantiality of the portion used in relation to the copyrighted work as a whole; and the effect of the use upon the potential market for or value of the copyrighted work”.

Page 21: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

21/33

Such regulation is already in place in the US and is being debated within the Japanese institutions for an adoption by end of 2010.

The paradox with the “Research Fair use” act is that players located in the US or Japan will have access to all resources over the world without any copyright infringement while the Europeans will be prevented from such access. This may impact the development of HLT within Europe.

Page 22: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

22/33

Issue 1.6: Evaluation

Issue Challenge Recommended actions

Evaluation

A framework to take care of LRTs evaluation in Europe

Establish common and standard Language Technology evaluation procedures

Devise new methods for LR quality check

Creation of an infrastructure for coordinated LRTs evaluation

Challenge 1.6.1 A framework to take care of LRTs evaluation in Europe Evaluation in Europe is currently taken care of by individual institutions (such as ELDA, CELCT) and by short-term projects (i.e. the TC-STAR, CHIL campaigns), but no stable coordination exists at international (EU) level, as there is in the US or in Japan (NIST and NII). In some specific areas, the community may organise itself to conduct regular evaluations (e.g. CLEF 2000-2010, Semeval), although with limited funding and a lot of community good will.

In the US, on the contrary, the role that NIST plays in coordinating technology evaluation is very important since it allows a control on the LRs produced and streamlines research and development of applications which can have interesting commercial developments.

Evaluation should encompass different topics: technologies, resources, guidelines and documentation. However, evaluation is in constant evolution just as the technologies, thus often not even stable (e.g. for Machine Translation). For example, new and more specific measures are required based on innovative methodologies esp. for evaluating the reliability of semantic annotations.

Current evaluation campaigns sometimes create rather artificial settings in order to be 'academically clean', which makes the tasks little realistic. One of the most critical challenges is to devise and set up new types of campaigns, possibly based on task-based evaluation. For practical purposes, guidelines and do's/don’ts would be helpful.

Recommended Actions 1 Establish common and standard Language Technology evaluation

procedures

Promote LRT evaluation as a major topic for research (research on metrics, methodologies, etc.), for instance proposing this topic as a subject for PhD research.

Promote the dissemination of information through LRT evaluation portals (e.g. the ELRA HLT evaluation portal, http://www.hlt-evaluation.org/).

Ensure proper distribution of evaluation packages.

2 Devise new methods for LR quality check

Define new agreed methods for validation of LRs.

Page 23: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

23/33

Ensure that a validation procedure is specified before the LR production phase starts.

Ensure that, for existing resources, at least a quick quality check is conducted.

3 Creation of an infrastructure for coordinated LRTs evaluation

Draw a clear roadmap for LRT evaluation for Europe with the most promising technologies.

Promote the identification of the existing evaluations and of the missing ones (e.g. languages that are not covered), and the knowledge of an estimate of the cost of an evaluation campaign.

Ensure that key technologies are evaluated, at least every 24 months, within publicly funded projects (using the right funding instruments e.g. public procurement with 100% funding schemes).

Ensure that all publicly funded evaluation projects come out with at least an evaluation package (resources, metrics, methodologies, evaluation reports, description of participating system to reflect state of the art).

Promote the use of web-services as platforms for HLT evaluation (ensure that such web-services are maintained and kept online).

Promote international cooperation within the evaluation campaigns.

Encourage the set up and/or consolidation of HLT evaluation centres.

Page 24: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

24/33

Direction 2: Research and Development

Direction

Issue Challenge Recommended actions

R&

D

A reference model for

creating LRs of the future

Sufficient Language Resources for all languages

Implement BLaRKs for all languages, especially less-resourced languages

More high quality resources Provide sustainable high-quality resources for all European languages

Improve best practices to ensure good quality annotations

Boost semantic annotation

Encourage applications based on strong theoretical foundations, avoiding addressing only short-term development issues

Resources creation on demand and at affordable costs

Encourage full automation of LR-data production

“Go green”: enforce recycling of LRs, i.e. favour re-use and re-purposing

Resource building through shared and/or new social means

Invest in Web 2.0/3.0 methods for collaborative creation and extension of high-quality LRs, also for BLaRKs creation

Start an open community-effort initiative for a large Language Knowledge Repository

Foster the debate and experiments on new outsourcing trends over the web

Issue 2.1: A reference model for creating the Language Resources of the future

Despite the vast amount of both academic and industrial investments, existing and available resources are not enough for satisfying the various needs of all different languages. This shortage of resources is both at the quantitative and qualitative level. We are missing the right amount of resources, of the right type and adequate quality. Since language resources are costly, it is necessary to start preparing now the resources that will serve for the applications of the future and can positively impact on the development of multilingual technologies such as Machine Translation, cross-lingual and Web 3.0 applications.

The importance of proper management of the “life cycle” of language resource creation has attracted less attention and been largely overlooked in our community.

A reference model for creating the Language Resources of the future is thus necessary, addressing the current shortage of resources in breadth (languages and applications) and depth (quality and amount of data). The reference model must also include an estimate of the production costs.

Page 25: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

25/33

Challenge 2.1.1: Sufficient Language Resources for all languages Universal Linguistic Rights require the provision of language services for all people in their own mother tongue. Allocating funding to cover all languages (also the less-well represented languages of the world) remains a high priority for ensuring multilingual applications in the future, and therefore language resources for all (also less-resourced) languages must be developed. At the same time, it must be borne in mind that many undocumented languages that represent our cultural legacy may become extinct in the digital age. Minority and fringe languages should be comprehensively documented in the form of spoken and written corpora. Similarly, manuscripts should be digitized.

Recommended Actions 1 Implement BLaRKs for all languages, especially less-resourced languages

It is highly recommended that BLaRKs are implemented for all languages, with a priority for the less resourced ones. First, the BLaRK concept needs to be worked out in detail, making it move to the level of a standard.

Second, regular BLaRK surveys must be conducted in order to yield a clear picture of technology evolution trends, to establish – and regularly update – a Roadmap with a focus on all aspects of LRTs.

Finally, the production of resources should be funded according to BLaRK-like criteria, i.e. giving priority to the development of missing types of resources for each language.

Challenge 2.1.2: More high quality resources High quality resources should be regarded as a key booster for the deployment of effective technology that impacts large sectors of activities (e-content, media, health, automotive, telecoms, etc.). In addition, human language technology may help as a facilitator for the effective documentation of the linguistic and cultural heritage of humanity.

Recommended Actions 1 Provide sustainable high-quality resources for all European languages

Work on the production of a durable reserve of “high quality” resources for all European languages, at least for the types mentioned in BLaRKs.

Secure funding and investment for the most sensitive and innovative resources, for instance for 2010-2015.

Examples of key resources are:

Aligned textual corpora for Machine Translation development (areas and genres different from the EC Parliament and institutions jargon, JRC-Acquis, EuroParl)

Broadcast news (audiovisual) with high quality annotations (transcriptions, speaker and turn identification, audio-visual

Page 26: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

26/33

summarisations, etc.), the same resources with subtitles, dubbing, etc.

High quality annotated corpora (from named entities to deep semantic annotations)

Corpora for evaluation (Information Retrieval, Question answering, Machine translation, speech transcription, etc.)

A European corpus of sign languages (for translation purposes)

LRTs for new trends, such as emotion detection, sentiment, opinion analysis, etc.

Multimodal corpora

2 Improve best practices to ensure good quality annotations

Renovated best practices to ensure good quality annotations are needed. The community should be made aware of which are the “success stories” in annotation, and knowledge about best practices should be made readily and easily available.

3 Boost semantic annotation

Boost more sophisticated resources with multiple layers of deeper semantic annotation.

Explore high-level semantic annotation formalisms that build upon more surface-based annotation levels. Make it possible to combine and extract this information on demand. New and more specific measures based on innovative methodologies for evaluating the reliability of semantic annotations are required.

4 Encourage applications based on a strong theoretical foundation, avoiding

addressing only short-term development issues Avoid addressing only short-term development of a specific product or service for a language (as a kind of simple toy). Instead, favour demonstrating applications based on a strong foundation.

Support both mid-term applied research and long-term basic research on all issues related to LR design, production and quality assessment.

Challenge 2.1.3: Resource creation on demand and at affordable costs

The need of large-sized LRs with complex, high level and quality information encoded, for the advancement of the LRT field is undisputable. The high cost of their production both in terms of time and manpower hampers their creation. Automation methods and tools are to be established and integrated in the process of LR production. Similarly, in order to reduce costs but also to encourage reusability of resources, it is important that resources be recycled as much as possible,

Page 27: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

27/33

Recommended Actions 1 Encourage full automation of LR-data production

The coverage problem is of such a nature that strategies approaching or envisaging the full automation of (high quality) LR-data production have to be promoted. Existing tools and new automation techniques must be improved, especially for higher-level tasks such as semantic or content-related and multilingual ones. In parallel, evaluation in real-life applications is to be fostered so that research can approach progressively the characteristics of the materials needed by the industry in size and granularity of the information contained.

2 “Go green”: enforce recycling of LRs, encourage re-use and re-purposing

The creation of new resources from scratch should be discouraged when resources are already available for a given language and/or application: re-use and re-purposing should be promoted instead. A “recycling” culture should be developed, in terms of reuse of development methods, existing tools, use of translation/transliteration tools, etc. The experience gained for one language can be used to process others. It is encouraging to see high-level applications for Less-Resourced Languages (instead of just the usual “taggers”), for instance ASR for Amharic may pave the way to technical approaches to design baseline systems for these Less-Resourced Languages.

Similarly, most language technologists use existing language resources as input and create, as by-products, materials that could become useful language resources for others, but few of them make these resources available for others at the end of the production cycle.

Challenge 2.1.4: Resource building through shared and/or new social means Given the high cost for language resource production, the power of social/collaborative means to build resources should be considered, in particular for those languages for which expert-built language resources do not exist. This is particularly sensitive for less-resourced languages in order to make the related language technology advance rapidly to help minority language speakers to access education and the Information Society. Shared or social means of building resources seem well suited to collect raw data that are crucial for the development of LT applications and they are increasingly being used also to annotate data. However, it is still far from being understood if all types of LRs can be obtained collaboratively by using naïve annotators; more research is needed on this topic addressing both technical and ethical aspects.

Page 28: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

28/33

Recommended Actions 1 Invest in Web 2.0/3.0 methods for collaborative creation and extension of

high-quality LRs, also for BLaRKs creation

An investment in Web 2.0/3.0 methods for collaborative creation, extension and annotation of high-quality LRs should be made. In particular, acquisition of raw data, regardless of quality, could readily be achieved by engaging massive amounts of people through social means. The investment should also consider an accurate analysis of the respective quality and content of collaboratively-built and expert-built resources.

2 Start an open community-effort initiative for a large Language Knowledge

Repository

All existing and available resources and sources are insufficient to solve the problem of creating free and complete resources for the world languages, even for those that have some web presence. A proposal is to build a Web 2.0 site for using the same community computing power that generates millions of blogs to solve the problem of creating basic language resources – and their annotations – for all the world’s languages, starting with the 446 present on the web.

3 Foster the debate and experiments on new outsourcing trends over the web

Since in many cases manual work cannot be avoided, e.g. if accurate models are requested or if a reliable system evaluation is foreseen, a new trend of outsourcing jobs (e.g. speech recording, text translations, text annotations) over the web is gradually taking place (like the Amazon Mechanical Turks paradigm). This paradigm seems to allow a drastic reduction of the cost of producing LRs: given some post-processing, human expert quality can be achieved. However, this may raise ethical, sociological and practical issues for the community. Conclusive reflection and experimentation must be supported in the community on this problem.

Page 29: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

29/33

Direction 3: Political and Strategic dimensions

Issue 3.1: (International) Cooperation

Issue Challenge Recommended actions

(International) Cooperation

World-wide cooperative and coordinated programs

Boost cooperation among countries and programs

Organise community-wide cooperation among infrastructural initiatives

New reference models for LR production

Encourage shared constructions of resources as a means to achieve better coverage

Develop a mixed-funding framework (with national/regional funders, the EC, science and industry joining forces)

Challenge 3.1.1: World-wide cooperative and coordinated programs Cooperation among countries and programs is essential (in particular on infrastructural issues) so as to make the field advance in a coordinated way and at the same time to avoid duplication of efforts and fragmentation.

A coordinated effort at the international level would greatly help by providing less advanced countries/languages with examples and best practices, such as the definition of a commonly agreed basic set of Language Resources which have already been proven necessary to correctly produce the corresponding technologies for other better represented languages. Such international effort should also aim at dynamically identifying the gaps and defining the roadmap to fill them.

International cooperation among infrastructural initiatives is important for obvious reasons: first, in order to not duplicate the efforts; second, to make standards truly international; third, to encourage free exchange of ideas.

Recommended Actions 1 Boost cooperation among countries and programs

Cooperation among countries and programs must be encouraged and promoted at all levels, not just within Europe.

Consider and discuss the creation of a network of funding agencies, in relationship with the EC, in order to develop a joint approach to support LRTs.

2 Organise community-wide cooperation among infrastructural initiatives

Current – and future – infrastructural initiatives should not proceed in isolation. Instead, coordination among ongoing and future similar initiatives should be required for every infrastructure.

Page 30: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

30/33

To this end, conduct an inventory of existing infrastructures, to identify synergies, duplications, gaps and shortages.

Challenge 3.1.2: New reference models for LR production Another area of long-term intervention at the global, transnational level is that of the creation of an ecosystem of LR, both data and tools.

The current model of the LR-data market is mostly based on purely competitive terms. This leads to not adherence to standards, repetition of work and efforts, etc.

Not only the academy, but also the industry of LRTs has to undergo cultural changes and to recognise the high added value of participating in the creation of common pools of LR-data. Such a change requires movements in unison for all the stakeholders, the creation of rules and guidelines for new forms of cooperation, and to sharpen a culture of mutual respect and fairness. Actions towards fostering such a change are a first priority for the field.

New models for the production and sharing of LRs have to be devised, and at present collaborative strategies seem the most promising ones. Under this respect, the TAUS case where translation memories are shared/exchanged among the associations’ members is a success story that can teach a lot for whatever kind of language resources.

Recommended Actions 1 Encourage shared constructions of resources as a means to achieve better

coverage

It is foreseen that the collaborative accumulation/creation of data is the best and most practicable way to achieve better and faster language coverage and that in purely economic terms, the return on investment could be more important than expected.

2 Develop a mixed-funding framework (with national/regional funders, the

EC, science and industry joining forces)

To this end, a mixed-funding framework by which national/regional funders, the EC, science and industry can join forces is considered as more appropriate than single-source funding.

For instance, set up a specific European investment fund for the production of LRs as a public-private partnership.

Design accurate roadmaps for LR production with the strong involvement of local players but also with endorsement by international organisations (UNESCO, UN, etc.).

Page 31: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

31/33

Issue 3.2: Funding Agencies Policies

Issue Challenge Recommended actions

Funding Agencies policies

Devise models to allow different types of players easy access to resources

Ensure that publicly funded resources are publicly available either free of charge or at a small distribution cost

Encourage/enforce use of best practices or standards in LR production projects

Make sustainability and sharing/distribution plans mandatory in projects concerning LR production

Challenge 3.2.1: Devise models to allow different types of players an easy access to resources At present, the usability of resources is prevented by their accessibility restrictions. An effort is needed in order to make resources more easily accessible. “Easy access” means many different things. It means that users can be granted exploitation rights for a resource, or that a resource can be shared/distributed free of charge, or even that a resource can be easily repackaged and repurposed and then the derived “product” made available as well.

Obviously, this holds for all language resources, but it is particularly crucial for “basic” language resources that represent the developmental foundations for many language technology applications.

The development of such resources is expensive and it is not feasible that each group that needs such a resource is able to pay for it, or to develop it by itself.

Language resources should be considered an important part of the cultural assets and inheritance of Europe and for this reason, public funders should take care of their development, maintenance and sustainability, whenever the private fails to play such role.

Recommended Actions 1 Ensure that publicly funded resources are made publicly available either

free of charge or at a small distribution cost

Funding agencies can play a valuable role in facilitating access to Language Resources, for instance by ensuring that publicly funded resources are shared/made publicly available, either free of charge or at a small distribution fee. A common feeling is that the entire society should freely benefit of the resources developed with public money.

For European and National projects developing LRs, for instance, it must be made mandatory to release the data as deliverables together with the reports describing them (assuming the Language Resources are fully funded). As a top priority, this should apply first to basic language resources (defined on BLaRK-like criteria).

This recommendation must also take into account the data distributor/data centre. Most LRs that are distributed by large data centres are created with

Page 32: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

32/33

at least some public funds (if not entirely so). A balance needs to be achieved between making those resources readily available for academic research and sustaining the infrastructure that archives and distributes these data sets (into perpetuity, theoretically). The community generally agrees that such centres are needed, but if all data were made available at no cost to academic researchers (the largest LR user group) those data centres could not survive. Actually, there are examples in the US and Europe where LRs are made available at no cost or on very favourable terms for academic/not-for-profit use.

2 Encourage/enforce use of best practices or standards in LR production projects

Accessibility of resources will greatly benefit from encouraging and/or enforcing the use of best practices or standards in projects having to do with LRs, at all levels of applicability. Such practices should be promoted and made mandatory in publicly funded projects.

3 Make sustainability and sharing/distribution plans mandatory in projects concerning LR production

Funding agencies must ask funded projects to make a detailed plan for subsequent sustainability and sharing/distribution of the resources created. Such plans should not be as the usual business-plan that used to be part of the EC proposals but were often drafted by researchers without any market knowledge.

Page 33: Deliverable 8.2b Blueprint of Actions and Infrastructuresd.researchbib.com/f/9nq3q3YzMfLKWyozI0YzI1Y3AcqTImY2EyMzS1... · D8.2b – Blueprint of Actions and Infrastructures 1 ECP-2007-LANG-617001

33/33

Issue 3.3: LR citation

Issue Challenge Recommended actions

LR citation

Appropriate citation of Language Resources like traditional publications

Develop a standard protocol for citing language resources

Challenge 3.3.1: Appropriate citation of Language Resources like traditional publications Language Resources (both data and software) are time consuming and costly and increasingly account for a large amount of research. As such they deserve high credit and need to be citable in a way similar to that of scientific papers. A standard way of citing LRs would be highly desirable.

Recommended Actions 1 Develop a standard protocol for citing language resources

Set up a standard citation framework for resources: develop a mechanism for citing language resources in a uniform way. This will also enforce use of minimal metadata descriptions. Providers will be responsible and credited for that.