resources and technologies forum - flarenet · the european language resources and technologies...

The 2 European Language Resources and Technologies Forum

nd

Language Resources of the future the future of Language Resources

Barcelona, 11-12 February 2010

Proceedings

Edited by: N. Calzolari, P. Baroni, M. Monachini, C. Soria

Istituto di Linguistica Computazionale del CNR - Pisa, ITALY

Table of Contents

Introduction .....................................................................................................................................................7

Program ....................................................................................................................................................... 11

Opening Session ......................................................................................................................................... 13

Session 1 – Metadata and Documentation............................................................................................ 19

Session 2 – Services and Functionalities for an Open Resource Infrastructure................................ 37

Session 3 – Sharing or not Sharing: Availability and Legal Issues ................................................... 61

Session 4 – Social Networking and Web 2.0 Methods for Language Resources .......................... 77

Session 5 – Language Resources of the Future ..................................................................................... 95

Session 6 – International Cooperation..................................................................................................113

Closing Session..........................................................................................................................................115

Organisation .............................................................................................................................................117

Introduction

Nicoletta Calzolari – FLaReNet Coordinator – ILC-CNR [email protected] – http://www.flarenet.eu

Language Technologies, together with their backbone, Language Resources, provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of Language Technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this, we need to act as a community able to join forces on a set of shared priorities.

The European Language Resources and Technologies Forum promoted by FLaReNet is an international forum aiming at developing the necessary common vision by facilitating the interaction among Language Resources stakeholders and providing strategic recommendations and priorities for future initiatives in the field.

The 2010 edition of the forum replicated the results achieved by the 2009 edition, by stimulating a lively debate and discussions and gathering a large number of players in Language Resources and Technologies. The forum participants had the unique opportunity to:

networking with leading academic experts and industry stakeholders from all across Europe and worldwide;

debating and making key policy aware of what needs to be done to improve the field of Language Resources and Technology;

assisting in outlining a collaborative and shared infrastructure for European Language Resources;

being alerted about new initiatives, developments and trends;

participating in the definition of the FlaReNet Blueprint for Actions and Infrastructures, which will set up priorities for fostering Language Resources and Language Technology, thus contributing to shape the future of the field.

The topics chosen for this second edition of the forum correspond to hot topics of today and tomorrow in the field of Language Resources and Technology:

S1 – Metadata and Documentation;

S2 – Services and Functionalities for an Open Resource Infrastructure;

S3 – Sharing or Not Sharing: Availability and Legal Issues;

S4 – Social Networking and Web 2.0;

S5 – LRs of the Future;

S6 – International Cooperation.

The intention was to approach each of them trying to identify controversial aspects, risks, what is missing, gaps to be filled, what can/cannot be achieved. Multilingual, multimodal, multimedia issues were taken

7

mailto:[email protected]

http://www.flarenet.eu/

into account as different points of view in each of the forum sessions. The chosen formula was that of a collaborative workshop composed of a series of working sessions on specific topics. The forum was aimed at raising discussions in a very interactive and creative way, thus creating a breeding ground for open questions, new ideas, and visions for the field towards a multilingual digital Europe.

The event was shaped as a two-day workshop with plenary and break-out sessions. The spirit of the topical sessions was that of working meetings with specific focus(es) and extensive discussions, according to the philosophy for which “everybody is a player”. Each session was briefly introduced by its own Chair, followed by the intervention of invited speakers from both within and outside the FLaReNet consortium, who gave short presentations about their views. Other participants actively contributed to the sessions with their views, reflections, and support material. Plenty of time was devoted to discussion, with the help of a number of predefined discussants. For each session, a Rapporteur presented summaries, feedback and conclusions in the final plenary session, and raised additional discussion and debate.

The meeting thus provided a unique opportunity to establish a common forum among the various actors and a lively intellectual framework for starting specific theoretical and practical work in the longer term. Topics that emerged from the presentations and discussions in the various sessions could be the object of specific meetings to be organised in the next future. The results of the discussion will also serve as an input to start a new Roadmap for LRs/LTs and will be integrated in the recommendations issued by FLaReNet to the EC, national organizations, and industry.

8

Program

Thursday 11th February 2010

9:00 Registration

10:00 Opening Session

11:00 Coffee Break

11:30 S1 Metadata and Documentation

13:30 Lunch

14:30 S2 Services and Functionalities for an Open Resource Infrastructure

16:30 Coffee Break

17:00 S3 Sharing or not Sharing: Availability and Legal Issues

19:00 End 1st Day

20:30 Dinner

Friday 12th February 2010

S4 Social Networking and Web 2.0 Methods for Language Resources

9:00

Coffee Break 11:00

S5 Language Resources of the Future

11:30

Lunch 13:30

S6 International Cooperation

14:30

Closing Session 16:30

End 2nd Day 17:30

Thursday 11th February 2010

Opening Session 10:00-11:00

Chair: Salvador Giner

Nicoletta Calzolari (CNR - ILC / FLaReNet Coordinator)

Bernat Joan (Secretari de Política Lingüística de la Generalitat de Catalunya)

Salvador Giner (President of Institut d’Estudis Catalans)

Núria Bel (Universitat Pompeu Fabra / FLaReNet) and Joan Soler i Bou (IEC / Local Host)

Roberto Cencioni (European Commission - INFSO - E.1 / Head of Unit)

13

INAUGURACIÓ JORNADES FLaReNet Barcelona, 11 de febrer (Institut d’Estudis Catalans)

Moltes gràcies per haver escollit Barcelona per celebrar aquestes jornades de la xarxa Flarenet a Barcelona. La nostra acadèmia nacional constitueix un bon entorn per a la seva realització. Per als catalans i, més concretament, per al govern de Catalunya, la inversió en tecnologia lingüística ha estat una aposta estratègica des de fa molts anys. La llengua catalana ha passat per etapes molt complicades, al llarg del segle XX, i la recuperació de l’autonomia, la implementació de l’oficialitat lingüística i la posada en marxa d’una política lingüística pròpia feien necessari fer atenció a les tecnologies de la informació i de la comunicació. La llengua catalana s’havia d’adaptar ràpidament a les noves tecnologies i fer seus els avanços tecnològics punters per aconseguir recuperar el lloc que li pertoca entre les llengües d’Europa. El nostre objectiu, en aquest sentit, és que el català tingui un nivell de recursos tecnològics equiparable al de les llengües oficials de la Unió Europea. Ja als anys 90 vam començar a invertir en el desenvolupament de sistemes de traducció automàtica per al català. Actualment oferim com a servei públic, en línia, traducció automàtica entre el català, el castellà, l’anglès, el francès i l’alemany. L’any 2009 aquest servei ha atès més de 10.5 milions de peticions. També oferim un servei públic de traducció automàtica entre l’occità, el català i el castellà (aquest en programari lliure) que l’any 2009 ha atès unes 60.000 peticions. Estem ben orgullosos de dotar també recursos tecnològics (i oficialitat lingüística) per a la nostra minoria lingüística històrica, de llengua occitana, a la Vall d’Aran. Hem desenvolupat un curs de català en línia, parla.cat, que permet aprendre català de manera autònoma o amb tutoria i assolir els nivells A2, B1, B2 i C1 del Consell d’Europa. Parla.cat incorpora tecnologia innnovadora i permet la interacció oral i escrita entre alumnes i professors. Actualment té més de 40.000 persones inscrites, de més de noranta països d’arreu del món. Hem desenvolupat, així mateix, un cercador lingüístic, Optimot, que permet resoldre automàticament consultes lingüístiques en línia. L’any 2009 ha atès més de 27 milions de consultes. I hem desenvolupat Plats a la carta, una eina automàtica per elaborar cartes i menús de restaurants en català, castellà, anglès, francès, alemany i italià. Té actualment prop de vuit mil usuaris. És un motiu de satisfacció per al govern de Catalunya, doncs, que Flarenet s’hagi reunit a Barcelona per obrir novament el seu fòrum sobre tecnologia lingüística. Espero que les jornades siguin propícies per intercanviar experiències, generar idees i crear sinèrgies profitoses per a cada llengua en particular i per al multilingüisme en general. Bernat Joan Secretari de Política Lingüística de la Generalitat de Catalunya

15

LETTER FROM THE REAL ACADEMIA ESPAÑOLA

Since the end of the XXth century, the Real Academia Española, RAE (Royal Spanish Academy) has been very interested in all the applications offered by computer technology to the professionals devoted to the study and research on natural languages. In this scientific field, the RAE has built diachronic corpora, such as the Corpus Diacrónico del Español (CORDE), as well as synchronic corpora, such as the Corpus de Referencia del Español Actual (CREA), which, at the moment, amount to more than 400 million word-forms. In the field of lexicography, the RAE has created a database from its Diccionario de la Lengua Española (2001) and has built a very large anthology of pictures from the dictionaries of the Spanish language from 1492 to the present time, which is called Tesoro Lexicográfico de la Lengua Española. Since the Academia’s foundation in 1713, public service has been its most important concern. This principle has led the RAE, from the beginning, to make this rich and interesting data repository available on the Internet to all the researchers (www.rae.es). And the number of queries received makes up for the effort made: only the usual Diccionario reaches 230 million queries per year.

The technical advances in the first decade of the XXIst century lead the Corporation to go on with the lines already started in the last century (CORPES, Corpus del Español del Siglo XXI), and to launch new projects that, because of their complexity, can only be tackled by means of new and powerful computer programs. The Diccionario Histórico de la Lengua Española, a State project that will exist only on the Internet, stands out as the result of new studies carried out. The advances mentioned earlier, as well as the Academia’s website success, have led to the creation of a new portal to offer all the current resources and the resources that will be created in the near future to the users. The access to this new portal will be done by using all the new telematic systems.

All the completed projects and the new ongoing ones, as well as their open access for the researchers, lead the Real Academia Española to strongly support the activities carried out by the European Fostering Language Resources Network, with absolute certainty that FLaReNet’s activities and agreements will benefit decisively the advances for a solid research and a supportive dissemination of language resources.

17

http://www.rae.es

S1. Metadata and Documentation 11:30-13:30

Chair: Gerhard Budin – Rapporteur: Nancy Ide

Introduction by the Session Chair

Introductory Talks

“Metadata in Context” Keith Jeffery (STFC Rutherford Appleton Laboratory, UK)

“From Road Maps to Plans:Towards the Design and Cost-Benefit Analysis of a Universal Language Resource Catalog” Christopher Cieri (University of Pennsylvania - LDC, USA) and Khalid Choukri (ELRA / ELDA, FR)

Contributions

“Aspects of LRs management: creation and utilisation” Takenobu Tokunaga (TITech, JP)

“Is there any need for guidelines for future speech corpora?” Isabel Trancoso (INESC-ID / IST, PT)

“Language tagging using the new RFC 5646” Richard Ishida (W3C, FR)

“Best Practices for Resource Documentation: Results of SILT Meeting on Operationalizing Interoperability” Nancy Ide (Vassar College, USA)

“The Metadata Harvesting Day” Marta Villegas and Carla Parra (Universitat Pompeu Fabra, SP)

“…” Key-Sun Choi (KAIST, KR)

Discussants

Dafydd Gibbon (Universität Bielefeld, DE)

Ineke Schuurman (Katholieke Universiteit Leuven - CCL, BE)

Monica Monachini (CNR - ILC, IT)

Victoria Arranz (ELDA, FR)

19

Metadata in ContextKeith J. Jeffery

IntroductionThe position taken is informed by my position, responsibilities and experience in a large multidisciplinary research laboratory with >100 servers, 360,000 users and ~10Pb data per year. I am also president of ERCIM www.ercim.org, euroCRIS www.eurocris.org , Chair of the Alliance for Permanent Access to the Records of Science www.alliancepermanentaccess.eu , and a board member of EOS (Enabling Open Scholarship) http://www.openscholarship.org/ .

MetadataMetadata is in some ways a meaningless term since depending on context it may be used as data (i.e. processed directly) or used as metadata (to assist in using or interpreting other data). Metadata can be classified into :1.schema metadata: which constrains the object and controls integrity (like a database schema or a XML schema);2.navigational metadata: which locates the object e.g. a URI;3.associative metadata: which assists in using or interpreting the object and is subdivided into:a)descriptive metadata: describes the object e.g. library catalogue card or DC (Dublin Core) or MARC;b)restrictive metadata: which constrain the use of the object e.g. rights, payment, curation and preservation;c)supportive metadata: these relate not to the object but to the domain of discourse in which the object is located, examples are lexicons, dictionaries, thesauri, domain ontologiesAn important aspect is that – for metadata to be machine-understandable as well as machine-readable (and only human understandable), it must have a formal syntax and declared semantics.

CERIFCERIF (Common European Research Information Format) was developed by a government-nominated group of national experts. It is a data model for information exchange, but also can be used as a data model for a system. It is used both as data and metadata. It covers research information with major entitris being person, organizational unit, project, funding, publication, product, patent, skills, CV, facility, equipment, service, event. Its key characteristics include: multilinguality, multimedia, the concept of base entities and link entities (where the link entities express N:M relationships between base entities with associated role and temporal period) and optionally multiple semantic classifications with crosswalks.

ServicesIn the Future Internet it is not enough to have metadata describing objects such as (hypermedia) documents. We have to describe software, processes too. It is emerging that this is best done by having metadata describing services. The classification of metadata outlined above serves well also this purpose. With millions of nodes and a free market in offered services the systems administration and systems development task becomes too great and too human intensive. Thus we need to automate. Recent work on self-* (self-managing, self-tuning, self-repairing…) services and systems indicates that – with appropriate metadata –this is a possibility.However, serious challenges remain due to the complexity of self-composition of services and possible emergent properties. Furthermore such an environment has dangers in security, privacy and trust. The problems of specifying and managing a distributed and parallel environment virtualized as GRIDs or CLOUDs is in its infancy.

ConclusionThe take-home message is:Metadata is the key technology for digital object management: discovery, composition, management;For systems to be scalable the metadata must have: formal syntax, declared semantics and be processable by (at least) first order logic.

21

http://www.ercim.org/

http://www.openscholarship.org/

http://www.alliancepermanentaccess.eu/

http://www.eurocris.org/

From Road Maps to Plans: Towards the Design and Cost-Benefit Analysis of a Universal Language Resource Catalog

Christopher Cieri and Khalid Choukri University of Pennsylvania, Linguistic Data Consortium

European Language Resource Association

Introduction The large and growing number of language resources (LRs) distributed by an increasing number of organizations coupled with the growing need for such resources, their cost and, thus, the penalty for failing to identify them when needed combine to create demand for improved meta-resources for resource discovery, description and exploitation. There are today, more than 1000 LRs distributed by just the two largest data centers, the European Language Resource Association (ELRA)1 and the Linguistic Data Consortium (LDC)2. Sometimes following their example, sometimes inventing new models, a number of national and regional data centers have arisen including BAS3, GSK4, CSLU5, Chinese LDC6, and the LDC for Indian Languages7. Other organizations such as the NICT Language Grid8 have innovated new approaches to resource sharing. National Corpora projects, including British9, Dutch10, American11, Czech12, Slovak13, and Russian14 efforts, have arisen sharing their resources sometimes through the large data centers and sometimes independently. Smaller projects continue to mushroom sometimes distributing LRs via data centers but also distributing them via project specific sites or via yet other hosts such as SourceForge. Although researchers working in a few combinations of language and technology type can hope to find sufficient resources to meet their needs at the largest data centers, the probability of identifying all relevant resources for a given project is small, and probably decreasing in the current environment. Two prior harmonization efforts have made progress toward the goal of a universal catalog for language resources. The Open Language Archives Community (OLAC) 15 designed and implemented the infrastructure by which such a catalog could exist and convinced more than 40 separate data centers to share their metadata bringing the total number of resources catalogued to approximately 35,000. The Networking Data Centers project added to OLAC the metadata for ELRA and LDC LRs available at that time and established procedures through which new records may be added. For example, LDC metadata continues to be updated in OLAC on a daily basis. Despite this progress, researchers must still master multiple metadata sets in order to search multiple locations in order to find needed resources or else risk failing to note the existence of critical LRs and then either recreate them or else do without them.

1 http://www.elra.info 2 http://www.ldc.upenn.edu 3 http://www.phonetik.uni‐muenchen.de/Bas/ 4 http://www.gsk.or.jp/index_e.html 5 http://cslu.bme.ogi.edu/corpora/ 6 http://www.chineseldc.org/ 7 http://www.ldcil.org/ 8 http://langrid.nict.go.jp/en/index.html 9 http://www.natcorp.ox.ac.uk/ 10 http://lands.let.ru.nl/cgn/ 11 http://www.americannationalcorpus.org/ 12 http://ucnk.ff.cuni.cz/english/index.php 13 http://korpus.juls.savba.sk/index.en.html 14 http://www.ruscorpora.ru/en/ 15 www.language‐archives.org

22

Although OLAC provides specifications for OAI (Open Archives Initiative16) compliant metadata as well as routines for harvesting, interchanging and searching their metadata, major data centers (e.g. ELRA, LDC) continue to maintain their own separate catalogs using somewhat different categories and terms exporting only subsets to OLAC. Recently, a number of initiatives have sought to extend existing LR cataloging along a number of new dimensions. ELRA’s Universal Catalog (UC) and the NICT Shachi17 catalog both intend to serve as union catalogs like OLAC. The UC focuses on resources intended for HLT R&D but includes a greater percentage of ELRA metadata fields and exploits data mining to discover resources not produced or distributed by ELRA. Shachi differs from OLAC in that catalog records are scraped rather than harvested as part of a bilateral negotiation but also uses data mining technologies to discover information about LRs that may not be present in their home catalog entries. The LREC Map initiative exploits the biennial LREC (Language Resources and Evaluation Conference18) abstract submissions to increase the contribution of LR metadata. Each author is asked to complete a simple template answering questions about the LRs described within the proposed paper. The LDC LR Wiki19 identifies LRs, including interactive web-based dictionaries and sources of raw text, especially for less commonly taught languages organized by language and LR type with individual sections edited by area experts. Some resources, especially plain and parallel text and lexicons, are identified and even harvested via automated scraping activities. Believing that the community currently lacks the knowledge required to normalize the metadata for LCTLs, the wiki permits free text description and intends to attempt normalization as an activity under the current proposal. Finally the LDC LR Papers Catalog20 enumerates research papers that introduce, describe, discuss, extend or rely upon another LR. Currently, LDC focuses on papers dealing with LDC data resources and includes full bibliographic information on the paper plus at least one link to the unique identifier of an LDC data resource referenced in the paper. Table 1 summarizes the nature of these organizations and projects.

OLAC ELRA UC LREC Map NICT Shachi

LDC LR Wiki

LDC Papers Catalog

external resources

normalized metadata deferred

raw resources

scraping

data mining

papers as LRs

Continuing to pursue these individual cataloging efforts independently carries the considerable risk of perpetuating the current state of affairs in which researchers must still consult multiple catalogs with their different approaches, structures and terms wasting time and sometimes failing to find relevant LRs. It would be preferable to coordinate these activities so that cataloging efforts are linked and searchable from a single starting point. Accomplishing this goal requires the completion of two major tasks: 1) merging or otherwise rendering interoperable catalogs that carry the same type of metadata so that their records are, 16 www.openarchives.org 17 www2.shachi.org 18 www.lrec‐conf.org 19 lrwiki.ldc.upenn.edu 20 The LDC LR Papers Catalog is currently a local effort undertaken with LDC discretionary funds. Once the Catalog has reached an appreciable size, it will be opened to the community and additions from remote authors will be accepted.

23

or appear to be, simultaneously available and 2) designing a structure that integrates catalog types that have previously been seen as different and independent (raw resources, papers). The structure of the universal catalog envisioned here combines the best features of those described above and adds new capabilities: a) Metadata records relating to LR are shared or linked as equals through a bilateral, cooperative arrangement. b) Relevant LRs that are not available through such cooperative arrangements are nevertheless cataloged via scraping. c) Relations among LRs such as “abbreviates”, “extends”, “corrects”, “describes” and “uses” are encoded to integrate papers and newer versions of basic LRs. With respect to the first task of rendering separate catalog interoperable, metadata languages will be said to be interoperable if, when used to encode identical metadata, a filter can be written such that a query in one returns the same results as the filtered query run against the filtered version of the second. More formally: Interoperability of metadata languages L1 and L2 describes the capability of two metadata providers to interchange metadata records m1 written in L1 and m2 written in L2 for a single LR r via a function f that maps L1 to L2 such that a query q that returns r in L2 also returns r in f(L1) when issued as f(q).

The central position of LRs in HLT development and of metadata in LR distribution and exploitation suggests that this initiative be integrated within a broader range of harmonization activities. A use case that focused not only on metadata but on the entire process of LR identification, acquisition and exploitation would help assure the longevity of this approach. The case we envision is an infrastructure that supports the automatic discovery and processing of resources needed to build an HLT, for example the training of a parser from morphologically and syntactically annotated data. Metadata harmonized across data centers that fully describe the relevant LR is a necessary but not sufficient condition. The corpora themselves must also be described in a structured way that can be read by machines in order to identify the location and format of files and relevant annotations within them. To assure that non-identical annotations are compatible, the structured, machine readable description of the corpus must be linked to a data category registry. Finally, the corpora will need to have been created and distributed according to best practices and standards including a thorough human readable description of the methodology, a specification, which is, itself, versioned and cataloged.

Design We envision then one of more universal catalogs that:

1. Implement a standard set of well-documented metadata types, tags and relations. 2. Gracefully accept metadata tags outside the set defined in 1. 3. Abstract metadata hosting from presentation so that different compliant catalogs may display

different projections (records and fields) of the merged data. 4. Instantiate that design via open source technologies that handle hosting, conversion, interchange

and presentation of metadata 5. Merge the contents of at least

a. LDC Corpus Catalog b. LDC Papers Catalog c. LDC tools and specifications d. ELRA Catalog e. ELRA Universal Catalog f. LREC Map g. NICT Shachi Catalog

6. Include one or more hosts systems that accept compliant metadata from small group or individual providers.

24

In the process of developing these universal catalogs we propose the creation of a set of best practices documents distributed to LR providers and whose metadata to be catalogued. These new resources along with a specification of best metadata practices will be made available to other data centers and individual data creators to use in the creation of their own catalogs. To promote the sustainability of LRs held outside data centers, we will provide a centralized metadata repository with a harvesting protocol. To address the range of competences among LR searchers, the search engine will permit both use of controlled vocabulary fields and relevance based search of entire catalog records. A novel contribution of this effort will be searcher assistance based upon the relations among metadata categories (dictionary≅ lexicon) and prior search behavior (those who searched for “Gigaword” also searched for “news text corpora”). Coupled with searcher assistance, we will provide metadata creator assistance based on searcher behavior and behavior of other metadata providers (“93% of searchers include a language name in their search” but “87% of all providers include ISO 639-3 language codes” and “the metadata you have provided so far also characterize 32 other resources”). In order to effectively manage the harmonization of data center catalogs and the provision of metadata resource, we will construct a governance body specifically for this project. The group with include representatives of the project partners, sponsors, individual and small group resource providers and of LR users. We further propose to expand the scope of the universal catalog to include two important and frequently overlooked LRs on either end of the processing spectrum, raw unprocessed data and the most carefully processed LRs, research papers. Our intent is to enhance the universal catalog with links to raw resources for under-resourced languages including web sites rich in monolingual and parallel text and lexicons built for interactive use. These resources are necessary to advance the universal catalog toward the very apt goal of true universality as it affects languages whose representation among formal LRs is insufficient with respect to their global importance. Those who would create HLTs for these languages must resort to primary LR resource creation based upon harvests of these raw resources. With respect to papers related to LRs, our goal is to establish an infrastructure that we will seed with an initial batch of (thousands of) papers and later to integrate the creation of links between papers and the LRs they use as a regular part of the publication process. Some of this work has already begun in a number of individual efforts that have not been coordinated across this same span of data centers and LR creators. Specifically, a Less Commonly Taught Resource (LCTL) Language Resource wiki was developed by LDC within the REFLEX program. Similar efforts to harvest papers describing LRs are underway at LDC using human effort, at ELRA using the LREC map and within the Rexa project21 using data mining technologies. Our proposal here is to accomplish this expansion by data type while integrating model workflow methodologies into the workflow including social networking, web sourcing, and data mining.

Costs & Benefits Unlike the traditional costs benefit analysis, the costs and benefits described here must necessarily be abstract.

7. Principle costs types for this effort are: 8. creation and documentation of metadata standard: types, tags and relations 9. creation and maintenance of hosting, conversion, interchange, mining, search and display

technologies 10. conversion of metadata from existing catalogs 11. integration of metadata from new LR types

21 rexa.info

25

12. coordination across a number of data and metadata centers (ELRA/ELDA, LDC, NICT, and OLAC) and related projects

13. outreach to interested professional organizations (ACL, LSA, LinguistList, SIL, ISCA), journal editors (LRE, LILT), conference organizers (LREC, AFLR)

Funding for some of these efforts is already in place (FlareNet, T4ME, SILT). Other efforts continue using discretionary funds (LDC LR Wiki and Papers Catalog). Furthermore, some of the required capabilities already exist elsewhere. OLAC implements much of the infrastructure needed to organize, host, convert and display metadata. Shachi, the ELRA Universal Catalog and the Rexa project already use data mining to enhance metadata records. The ISO LR Map and the Gold Ontology already define many of the objects included in LR creation. To the extent possible, exploiting these capabilities will reduce the overall cost of the project. Benefit in this context should perhaps be estimated in terms of the funding saved when projects succeed in using existing resources rather than creating new ones. Although reused resources will frequently need enhancement for a given project, avoiding the initial cost of LR creation can result in significant efficiencies in many cases. LR development costs range from a few tens of thousands of dollars on the rare low end to over a million dollars for the largest. A few instances of reuse should exceed the cost of developing the universal catalog. In addition, there are benefits associated with the universal catalog that cannot be easily quantified. In some cases, a research program is inspired by an LR (for example the Penn Treebank, the British National Corpus) that makes it plausible. The contribution of research that might never have been considered or else that might have been delayed is an additional benefit of the UC. There is also evidence that ignorance of a related LR, blocks or alters a research agenda or delays it in the expensive task of re-invention. An informal survey of papers accepted at an international conference on Arabic HLT in 2009 found that fully one third failed to note the existence of a relevant LR. This subset either explicitly asserted the absence of an LR the reviewer knew to exist or else discussed creation of a similar LR without referencing one the reviewer knew to exist. An additional 30% found the needed LRs but failed to refer to them in a way that leads readers directly to the same LR. If we were to assume that this is a general patter we can see how the cost of developing the UC is quickly offset by the time and funding wasted by creating already existing LRs and is further motivated by the research inspired by the LRs it promotes.

26

Aspects of Language Resource Management: Creation and UtilisationTokunaga, Takenobu – Department of Computer Science,Tokyo Institute of Technology, [email protected]

I would like to raised issues concerning language resource management, from two viewpoints:(1) creating language resources, and (2) utilising language resources.

Language Resource Creation

Language resource creation involves various kinds of concrete and abstract entities, spanning fromproject managers and annotators, to documents and annotation schemata/guidelines used in annotat-ing them. The relations between these entities are also far from trivial. For instance, the creation of alanguage resource might involve the use of several annotation schemata (defined as tagsets); an anno-tator might be assigned to work on different annotation tasks (using different/multiple tagsets); and solikewise a document might be annotated with different/multiple annotation schemata; it is also possiblethat annotations might span over multiple documents, and so on.

It is no longer an uncommon practice to create a new language resource on top of an existing languageresource by adding an additional layer of annotation suited to a new task, e.g. adding named entityannotations upon an already morpho-syntactically annotated corpus. Such layered annotation furthercomplicates the relationships between the schematic entities, and also between the schematic entities andreal-world ones that manage them. Proper management of these entities and their relations can, therefore,substantially contribute to keeping the resultant language resources consistent, and improve their overallquality. As an example, creating schematic constraints that named entity tags can be annotated onlyon elements with certain types of pre-existing annotations (e.g. np tags) would prevent many carelessannotator mistakes. Despite this, the importance of proper management of language resource creationhas attracted less attention and been largely overlooked in our community. A standard or at least areference model for the management of the language resource creation process should be considered.

Language Resource Utilisation

Corpus-based approaches have been a main stream of Human Language Technology (HLT) for the lasttwo decades. These approaches always involve language resources and machine learning techniques toutilise them. Having relevant language resources has, therefore, become just as crucial as inventing a newmethod to achieve a given research goal. In some cases, the preparation of relevant language resourcesalone, when used with an off-the-shelf machine learning technique, may be sufficient for the task. In thiscontext, consolidating an environment in which researchers can easily find relevant language resourcesfor their purpose is a must. Currently many attempts have been made for standardising metadata of lan-guage resources and cataloging them. Metadata currently under discussion, however, may be insufficientto meet the demands of research. Looking into the future, we need to realise a query mechanism whererelevant language resources can be retrieved by queries specified in natural language, such as “I wouldlike to have a dialogue corpus in which such and such information is annotated.”.

A related but different issue is making language resources citable. As described above, languageresources increasingly account for a large amount of importance in research, and thus also in researchpapers, which describe the results. Unfortunately we have no standard way of citing them. As a result, inmany cases language resources are cited by indicating the URLs of their distribution sites or by referringto the research papers describing them. It is high time we make language resources citable first-classobjects.

27

Is there any need for guidelines for future speech corpora?

Isabel Trancoso INESC-ID / IST

We are witnessing the creation of too many small speech databases, without any concertation effort. They pop up in different languages for the same purpose, but grow totally different. I think guidelines would be very useful. The type of databases I'm talking about are built for different purposes: - diagnosis/therapy of speech handicaps/pathologies - foreign accents - seniors requiring assistance - kids involved in games/social networks Is LREC/ELRA doing some effort to help organize this database creation effort? If you think that this is worth discussing, or if ISCA can help in any way, just let me know. When asked to expand this short message into an abstract, I felt uncomfortable. Usually, when writing an abstract one presents ideas and facts that consolidate these ideas. All I had was a general feeling that is reinforced each time I'm asked to review papers on one of these topics in conferences such as LREC, INTERSPEECH or ICASSP. A couple of months ago, during a consultation meeting in Luxembourg on written and spoken language technologies, this feeling was again present. In fact, discussing future models of interaction raised the question of whether we already have the necessary speech resources to launch these efforts. The question may seem awkward in a world in which we are trying hard to cope with a virtually unbounded amount of data [Riccardi 2009], from different streams and media, facing variability in terms of domain, form of communication, language, style, genre, etc. For instance, personalized interfaces that learn with growing usage may be right for older people needing to control assistive technology [Moore 2009] or needing to keep in touch with the outside world and also feel less lonely. Are there already speech corpora of elderly people adequate for this type of study? The same question could be asked of speech corpora of people suffering from different sorts of pathology or handicap. This type of resources could be very useful for training diagnosis or therapy tools. Databases with children voices are also scarce, mainly for languages other than English. But we can only anticipate a growing use of speech technologies by this young population, eager to be involved in games and social networks. Second language acquisition is another area where in my opinion the potential of speech and language technologies is very far from being realized. Adequate databases are fundamental for research in this area.

28

Finally, so much can be learned by studying first language acquisition in children, an area where privacy preserving, non-intrusive data collection methods are more and more needed [??? 2009]. Posing the question about the availability of spoken corpora for all these different areas leads me to a unique answer: yes, there are spoken corpora for all of them, but they exist in very few languages, they seem to me totally heterogeneous, thus leading to non-comparable results even for the same language. This heterogeneous feeling is reinforced each time I'm reviewing papers on one of these topics. May be I'm totally wrong, and there is no way to design adequate databases for each of these categories of SLT users that are distinct from the normal adult speaking his/her native language. For instance, in CALL applications one may only target teaching intonation to a foreign speaker, or else target teaching the distinction between the pronunciation of two close consonants, so why build a one-size-fits-all corpus? I am not advocating the construction of such global corpora, just the existence of global guidelines that may redirect our corpora efforts in a more coherent way, that would simultaneously enable joint evaluation efforts in different languages and help young researchers starting their work in these promising application areas in a language for which these corpora do not yet exist. This abstract may give the wrong impression that we already have the adequate resources for adult L1 speakers. Privacy may be a great bottleneck. In fact, we need massive amounts of dialogs that would allow learning strategies from data [Young 2009], but such data are not normally available because of privacy issues. Hence we need privacy preserving frameworks for processing such massive amounts of data. A final word about multilinguality. One cannot avoid discussing it when talking about corpora. My country was not part of Europe when the first spoken corpora initiatives began. I used the guidelines of previously designed corpora (such as EUROM.1) in order to later build a Portuguese version. Much later, I was one of the SPEECHDAT partners, and helped building guidelines that were then used for building extended SPEECHDAT corpora in many other countries and languages. These experiences made me a strong believer in designing corpora guidelines that may help other national initiatives for the same purpose.

29

Language tagging using the new RFC 5646 Richard Ishida W3C – France

Language tags are used to indicate the language of text or other items on the Web and in other information spaces. They are used in Web content to indicate the language of a range of text (the text-processing language), for example in an HTML lang attribute or an XML xml:lang attribute, or for metadata about resources, such as the HTTP Content-Language header. Language tags can be used to describe a wide variety of resources, their use is not limited to HTML and XML. Work at the IETF has provided a flexible and wide-ranging solution for tagging languages. BCP 47 is the persistent name for a series of RFCs describing the syntax and use of language tags, published by the IETF. BCP 47 is actually a concatenation of two RFCs: currently these are RFC 5646, Tags for the Identification of Languages, and RFC 4647, Matching of Language Tags. The latest RFC describing language tag syntax, RFC 5646, obsoletes the older RFCs 4646, 3066 and 1766. The current version of the language tagging specification was published in the second part of 2009. Unfortunately people are often unaware that these RFCs have been updated.

In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry. This registry contains almost 8000 subtags for languages, scripts, regions, and variants, which can be combined in various ways. It also contains mechanisms for extension and private use subtags.

Language tagging is not, however, a precise science, and there are some deliberations that have to be made when choosing subtags. For example, thought is needed in the use of collection, macrolanguage and extended-language subtags.

This brief talk will propose that interoperability will be served best by widening the adoption of the language tags specified by BCP 47, and to that end will review the various types of subtag described by the RFC 5646 syntax, and look at some of the choices that need to be made when selecting subtags.

30

http://www.rfc-editor.org/rfc/bcp/bcp47.txt

http://www.iana.org/assignments/language-subtag-registry

Best Practices for Resource DocumentationResults of the SILT Meeting on Operationalizing Interoperability

Nancy IdeDepartment of Computer Science

Vassar [email protected]

1 Introduction

This report focuses on some of the results of arecent workshop organized by the US NationalScience Foundation-funded project SustainableInteroperablity for Language technology (SILT),whose overall goal was to arrive at an operationaldefinition of interoperability for several sub-areas.An operational definition identifies one or morespecific observable conditions or events and thentells the researcher how to measure that event; itmust be valid (does it measure what they are sup-posed to measure?) and reliable (the results shouldbe repeatable). The workshop considered interop-erability over four thematic areas:

1. Metadata for describing language resources

2. Data categories and their semantics

3. Requirements for publication of data and an-notations

4. Requirements for software sharing

This report described the results of the discus-sions concerning “requirements for publication ofdata and annotations”. The primary result wasa set of recommendations for documenting re-sources, described below.

2 Publication of Resources

2.1 MotivationCurrently, no guidelines or even common prac-tices exist for creating, documenting, and evaluat-ing language resources, including text and speechcorpora, linguistic annotations, lexicons, gram-mars, and ontologies, that are “published”–i.e.,made available for use by others. Some standardpractices for resource publication through estab-lished data distribution centers such as LDC orELRA exist, but even these are not completelyconsistent among different centers, and they are

not comprehensive. More crucially, many re-sources are made available via web distribution,and the format of the resource and informationabout creation methodology and resource qualityis highly variable and in some cases, non-existent.Given the recent increase in resource production,the need for standardized procedures for publish-ing resources is rising. Users need information toassess the quality of a resource, to replicate pro-cesses and results, and to deal with idiosyncrasiesor documented errors. This kind of documentationis very often unavailable or difficult to acquire.

Clear guidelines for resource publication willimpact the resource creation process, by speci-fying requirements for quality assurance and im-plicitly establishing baselines for cost, time frame,and requisite facilities, all of which are for themost part unknown at this time. Furthermore, suchguidelines will inform standard procedures for re-source evaluation, by establishing both clear spec-ifications for documenting a published resourcethat will figure into the evaluation itself and self-evaluation metrics that should accompany the pub-lished resource. Therefore, a set of standards forresource publication bears on several aspects of in-teroperability, some of which were addressed byother working groups at the Brandeis meeting.

Due to the lack of established procedures andpractices, the fundamental question addressed bythis group was therefore “What set of require-ments for the release or publication of a data re-source maximizes the potential usefulness and in-teroperability of that resource?” To answer thisquestion, the working group identified two broadtypes of requirements: (1) formats and access, and(2) documentation (taken in the broadest sense).Because (1) has received considerable attentionby other groups and efforts and some best prac-tices are already established, specifications for re-source documentation proved to most in need ofconsideration in order to ensure that published re-

31

sources are immediately usable and interoperable.Requirements for documentation are addressed inthe sections that follow.

2.2 Documentation of ResourcesWe recognize several different kinds of documen-tation, which may exist in one or several physicaldocuments or in header(s) associated with data andannotations. Each is designed to meet the needsof certain users of the resource. All documenttypes may not be applicable to all resources. Also,some documentation types overlap with metadataas covered in Section ??. The documentationtypes identified are:

i. High-level description: provides the non-expert, interested reader a good idea of whatis in the resource.

ii. Annotation/resource creation guidelines:guidelines directly used by annotators,validators, or creators of the resource (e.g.creators of lexicon or ontology entries, etc.).These may be in the local language. A moreglobal version in English should be providedwhen possible.

iii. Background: information on the theoreticalframework, background, and/or the “philoso-phy” of the resource.

iv. Methodology: a precise specification of themethodology used to create the resource. Thisinformation should be specific enough to en-able others to replicate the process and obtainthe same results. It should include

• full documentation of tools used in anyphase of the creation process, includingsoftware with version, source of the soft-ware, software documentation or publi-cation, and an indication of the platformthe tools were run on.

• data preparation methods, including in-formation about the data source, normal-izations/corrections performed, etc.

• error rates and manual valida-tion/corrections for automaticallyproduced annotations;

• description of annotation and qual-ity control procedures, including stan-dard inter-annotator agreement statistics(Kappa, P&R, TBD) if more than one an-notator annotated the same document.

v. Description of category semantics: prosespecification of the data categories and theirsemantics, with substantial examples from theresource. This should include documenta-tion of the evolution of specifications (if ver-sioned), illustrating the learning process.

vi. Formal specifications: XML-Schema, RDFschema, formal metadata specifications,grammar for annotation syntax, etc.

vii. Project documentation: Project description,location, personnel, contact. Statistics: fund-ing source, costs in person hours to create theresource.

viii. Data documentation: Corpus information:source, original format, errors in the data,trustability of source, OCR error rate (ifapplicable), copyright notice for data doc-uments included in the resource (if differ-ent from copyright for entire corpus), and aspecification of which annotations may ap-ply. Speech resources: how was the signalrequired, participants, etc. Much of the infor-mation needed here can be found in the TEIHeader.

ix. Resource documentation: Release date, ver-sion history, usage restrictions/copyright no-tice, availability, LDC or ELRA catalog num-ber.

x. Supporting materials: tutorials, presentations,published papers.

Crucially, some documentation must be pro-vided in a machine-tractable form in order to dis-cover and compare resources, validate formatsand annotations, appropriately process annota-tions, and retrieve relevant parts of the resourcefor a given use. These may include (at least) thefollowing:

• XML schema, RDF schema for documents

• Formal metadata specifications

• Physical and logical organization of the re-source

– File organization– Identifying naming conventions (e.g.

name contains wjnc for written, journal,noun chunks)

32

• Annotation layers/tiers

– Identifiers (e.g. morphosyntactic layer =msd)

– Relations to other layers (e.g., par-ent/child, refers to timeline/base seg-mentation))

• Annotation type labels (e.g. PTB for PennTreebank), used in annotation to identifysource

Within ISO TC37 SC4 Working Group 1(Linguistic Annotation Framework), mechanismsto provide machine-tractable information in theheaders associated with a resource are under de-velopment. To accomplish this, it is neces-sary to involve representatives of the speech andmultimedia communities to ensure that require-ments for resources of those types are accom-modated. We also propose to review of litera-ture on methodology for language resource cre-ation, insofar as it exists. Potential sources in-clude ELRA’s specifications for production, vali-dation, distribution, and maintenance of languageresources; LDC’s data creation methods1, reportsfrom earlier projects such as EAGLES and Eu-rotra, and “The Production of Speech Corpora”Cookbook2. For other media, a list of de facto orbest practice standards must be compiled.

1http://www.ldc.upenn.edu/Creating/2http://www.phonetik.uni-

muenchen.de/forschung/BITS/TP1/Cookbook/33

The Metadata Harvesting Day

Marta Villegas and Carla Parra IULA‐UPF

{marta.villegas; carla.parra}@upf.edu

In 2003, the ENABLER project showed very clearly that the general information about the existence and the nature of most language resources was very poor. “Only a small fraction of them is visible for interested users” (Enabler Declaration: 2003). Accordingly, since then other initiatives such as ELRA’s Universal Catalogue, CLARIN Virtual Language World, DFKI’s Natural Language Software Registry, etc. are involved in gathering information about resources and technologies, the most urgent need being to identify and locate available language resources.

Six years after the completion of ENABLER, the situation has improved but many linguistic resources and tools are still difficult to locate, although they are mentioned in existing catalogues. During 2009, we reviewed about 800 resources and, as stated in our FLaReNet’s Deliverable D6.1a:

“The compilation of information for this first survey was harder than expected because of the lack of documentation for most of the resources surveyed. Besides, the availability of the resource itself is problematic: Sometimes a resource found in one of the catalogues/repositories is no longer available or simply impossible to be found; sometimes it is only possible to find a paper reporting on some aspects of it; and, finally, sometimes the information is distributed among different websites, documents or papers at conferences. This made it really difficult to carry out an efficient and consistent study, as the information found is not always coherent (e.g. not every corpus specifies the number of words it has) and sometimes it even differs from the one found in different catalogues/repositories.”

In our opinion, notwithstanding the important efforts that the main resource and tools catalogues and observatories (ELDA/ELRA, CLARIN, etc.) do, the costs of curating and maintaining these catalogues and observatories updated are considerably high, as the data they need to gather is usually hard to be found. As a direct consequence of this situation, after all the efforts done the data we have is not always as updated as we would actually wish. Resource and tool providers and developers must we aware of the importance of guaranteeing the visibility of their resources, although tools for making it easy must be made available.

From our perspective, a new step is needed: We propose to start a decentralized effort of resource description and to launch an automatic, periodical information gathering routine. Each developer must enhance and guarantee the visibility of its resources by minimally describing them with a Basic Metadata Description (a BAMDES). This information will have to be made automatically harvestable in a server (easy-to-use ways of doing it will be provided) by different robots in what we will call The Harvesting Day. In order to easy the setting up of this initiative we will provide an online form that will automatically create the required XML for harvesting the information about resources and tools. Self-executable packages for setting up harvestable servers will also be provided.

Basically, a provider just needs to fill in the online form, save the BAMDES XML file that describes its resources and place it in a server with the self-executable package we also offer.

34

The automatic harvesting of metadata will then be possible and will supply with results the main catalogues and observatories, enhancing and guaranteeing the visibility of your resources and tools and ensuring that the information available about them is always up-to-date, as the harvesting will take place periodically.

To sum up, these are the key points of our proposal, the Harvesting Day (www.TheHarvestingDay.eu):

� WHAT? The Harvesting Day will be a routine in which a robot will collect Basic Metadata Description (BAMDES) describing resources and tools, as published at their websites.

� WHY? To allow LR developers to enhance and ensure the visibility of their language resources and tools.

� WHEN? The first Harvesting Day will take place on July 21, 2010 and will be then repeated periodically in an automatic manner.

� WHERE? You need to make your resources visible in your server. A self-executable package will be made available.

� WHO? Every resource and/or tool developer/provider is invited to participate. Harvesting results will be provided to the main resource and tools catalogues and observatories (ELDA/ELRA, CLARIN, T4ME...).

Enhance and guarantee the visibility of your resources. Get ready for the Harvesting Day!

35

http://www.theharvestingday.eu/

S2. Services and Functionalities for an Open Resource Infrastructure 14:30-16:30

Chair: Stelios Piperidis – Rapporteur: Khalid Choukri


Introductory Talks

“The challenge of multilinguality in Europeana: Web services as language resources”Luca Dini (CELI, IT) and Vivien Petras (Humboldt Universität zu Berlin, DE)

“Tomorrow's Language Resources”Jochen L. Leidner (Thomson Reuters Corp., USA)

Contributions

“We desperately need linguistic resources! - based on the users’ point of view” Satoshi Sekine (New York University, USA)

“Language resources for information extraction: demands and challenges in practice” Christos Tsalidis (Neurolingo, GR)

“Infrastructures - Shooting at a moving target” Peter Wittenburg (MPI, NL)

“Growing resources, raising tools, breeding collaboration” Hans Uszkoreit (DFKI, DE)

“Linguistic Awareness on the Web” Maria Teresa Pazienza (University of Rome “Tor Vergata”, IT)

“Will Language as a Service (LaaS) increase the interoperability in language resources and applications?” Virach Sornlertlamvanich (NECTEC, TH)

“Toward a standardized set of language service Web APIs” Yoshihiko Hayashi (Osaka University, JP)

Discussants

Eric de la Clergerie (INRIA, FR)

Tamás Váradi (Hungarian Academy of Sciences, HU)

António Branco (Universidade de Lisboa, PT)

Claudia Soria (CNR - ILC, IT)

37

The Challenge of Multilinguality in Europeana: Web Services as Language Resources

Luca Dini (CELI, IT) & Vivien Petras (Humboldt University Berlin, DE)

Abstract: Europeana has to face the tremendous challenge of providing multilingual functionalities for at least 10 languages (within the project phase of EuropeanaConnect, ultimately as many as official European languages (23)). It should be noted that these functionalities are particularly relevant for a multimedia collection such as Europeana: if accessing the full text of documents might raise obstacles due to language understanding, this is not the case for image or music material, which, however, is still only searchable in the language of the descriptions provided. The challenges that Europeana has to face concern several aspects of the lifecycle of language resources (the production phase being outside Europeana’s scope) and can be grouped in the following way: (i) quality assessment and selection, (ii) integration, (iii) maintenance, (iv) licensing.

Challenges

Quality assessment and selection In the first phase of the project an extensive scan of the available language resources was performed: these included both “free” resources and licensable resources available from project partners. These resources included both “low level” resources such as lexicons and morphological analyzers, and “high level” resources such as thesauri or bilingual dictionaries. Such a list immediately poses the problem of selection: Which resource is better suited to achieve Europeana goals? For some resource types, the criteria are set by a commonly agreed gold standard, e.g. POS tagging benchmarks. However, anybody who ever participated to an evaluation experiment is aware of how much time is spent on making in-house tools (let alone third party resources) compliant with competition standards. Even worse, for some crucial resources, such as bilingual lexicons, gold standards are just missing, and probably even difficult to conceive. The selection is therefore dependent on an application-based evaluation, which implies at least two steps: • To set up an application dependent gold standard, trying to foresee application requirements: time-

consuming but feasible. • To integrate each resource for testing in the evaluation workflow: this is where the real problems start, as

will emerge from the next section.

Integration Facing the integration problem for such a number of languages immediately raises the problem of lack of homogeneity/standards: For processing modules:

o Different capabilities; o Different output formats; o Different operating systems; o Different programming languages

For static resources: o Different linguistic assumptions; o Different tag set; o Different syntax; o Different coverage

If these problems can be overcome in a the context of the integration of several resources with a single application, they become hardly tractable when integrating different resources with each other, which is often the case when dealing with individual open source resources or for the integration of general purpose and domain specific resources.

Maintenance It is a well-known fact that maintaining a single resource is already an expensive task. In the case of Europeana, long-term maintenance could become a critical factor. Even at a “low” computational level, the presence of processing modules written in different programming languages and running on different operating systems is already a technical challenge. Maintenance becomes, however a real challenge when taking into account that a digital library is by default a changing environment. Even without pursuing the perfect “up-to-date-ness”, resources need to be updated and enriched at regular intervals, especially as far as

39

terminology is concerned. Assuming the existence of a perfect lifecycle for resource maintenance, (which, for some type of resources, is not the case), this means, in the long run, the existence of a department of at least 23 mother-tongue linguists who are will maintain each resource in their language.

Licensing The heterogeneity of licensing schemas also raises problems for integration. In the case of not-for-free resources negotiations are often difficult even for single resources, with a lot of contracting work behind. The case of “free” resources is, however, not much easier, as it not only requires to find a reasonable path across the constellation of different “free” licenses, but also to understand what the legal consequences of the integration of two resources with different licensing schemas are.

Can Web Services support Multilinguality in Europeana? Work package 2 of the project EuropeanaConnect has the goal of providing multilingual acces to Europeana. In order to face the complex situation described above, a radically service-oriented view of language resources is proposed. The basic idea is that any access of the Europeana infrastructure to language information is mediated by a standardized web service: this includes both multilingual information “strictu sensu” (needed for indexing, searching, managing classification schemas, etc.) such as lemmatization, Named Entity extraction, thesauri look-up etc. and cross-lingual information (basically bilingual dictionary look-up). For any high level functionality there should be a public WSDL describing the interface that a certain web service should have in order to be compliant with Europeana. In this way we expect to trigger a dynamic and distributed involvement of both project partners and external parties. While in architectural terms the advantages of such an approach are evident, the focus of this paper is to analyses in which sense they help to overcome part of the obstacles mentioned in the previous section.

Quality assessment and choice From a technical point of view, the adoption of Web services does not change the complexity and fuzziness of the evaluation process. However, they allow a loosely coupled resource wrapping, a less invasive and more agile process than local resource wrapping or, even worse, resource transformation. Moreover, in a competitive setting, they simplify the process of benchmarking: anyone proposing a better solution for Europeana is enabled to prove its claim just by implementing the relevant Web service, which could then be directly connected to the evaluation workflow, without any additional overhead.

Integration Concerning the integration of processing modules, it is clear that the Web service paradigm will solve most of the problems raised above, in particular: • Different capabilities. Web services are able to declaratively encode the capabilities they have. The

consuming application can therefore be tailored on that. • Different output formats. Translating a proprietary/unusual format into a public specification and

exposing it to the external word (under whatever licensing schema) is a much more convenient operation than producing a wrapper for the sake of one application.

• Different operating systems or programming languages: these two barriers to integration are overcome by definition, as the service can stay on the most adequate platform without enforcing any requirement to the calling application.

More interesting is the case of “static” resources integrated into a Web service, thus accessible as if they were a processing module. In this case the biggest advantage brought by the paradigm is in terms of cleanness and reversibility of operations. The resource doesn’t need to be integrated, but stays as it is. The wrapping web service just takes care of mapping the resource in the desired Web service format. The advantages of this process are: • The resource is not transformed in any sense, which allows seamless integration of successive versions

and minimizes conversion errors. • Any upgrade of the resource is immediately reflected by the quality of the service.

40

• Native (i.e. third party delivered) wrappers can be directly used without jeopardizing the non- functional features of the Europeana central system.

• There is the possibility that an Europeana compliant service is produced directly by third parties willing to participate to the common effort.

• Mapping among different tag sets can be realized as a public harmonization service. These advantages do not constitute an answer to the problem of integration of different resources: for that we probably have to look towards web service composition. This implies a shift from the view of language resources integration as a merging process to the one of a business flow. Under this view the process of language integration is “reduced” to the identification of possible preconditions of flow and access priorities to different functionally equivalent resources. The interesting thing is that these processes can be modelled in a declarative way thanks to the availability of several Web Service Composition Standards such as BPEL4WS, BPML, WSCI, DAML-S etc.: changing the integration flow (e.g. for testing different combinations or answering different functionalities) becomes consequently a matter of modifying an XML file describing the business logic.

Maintenance Web services cannot provide a solution to the maintenance problem on a technical ground but they can ease the process of outsourcing. As a service is by definition maintained by a service provider, Europeana could benefit from the opportunity of selecting which resources can be maintained internally and which ones can be just assigned as external web services, which might reduce maintenance costs and increase the quality.

Licensing One of the typical weak aspects in trading language resources is the lack (or the fuzziness) of licensing schemas. This is partly due to language resources being “strange” objects with a niche market and with poor competition rate, which inhibits the consolidation of acknowledged licensing schema. With the vision of “language resources as Web services” some light could be shed on the licensing issues, simply by importing practices from the business web service community, especially in the area of multimedia, business information and geographic information.

Conclusions Besides the obvious conclusion that web services could prove to be beneficial for initiatives of the linguistic complexity of Europeana, there is an aspect, which needs to be made more explicit: the desperate need for standards. Europeana could invent its own standards for communication among language resources or adapt some of the already available “protocols” (such as UIMA, which offers only a syntactic layer) but this situation should rather be discouraged, for an evident reason: Web services need an environment and a community and in order to be effective need to be re-used: a Web service with only one client (however big) makes little sense. It would be a waste of time and money if the initiatives currently under discussion and execution could not find a way of harmonizing data format and protocols.

41

Tomorrow’s Language Resources: What Could an Open Language Infrastructure (for Evaluation) Look Like, and How Do We Get There? Jochen L. Leidner (Thomson Reuters Corporation, USA & Linguit Ltd., UK) To date, Language Resources (LR) such as corpora, machine-readable lexicons, wordnets (i.e. a lexical resource similar to WordNet, for a language other than English) etc. have been curated within limited number of setups: perhaps the two most common modes of operation in the past are: (1.) A research project, such as an EU FP7 framework project, develops a new corpus by collecting text or speech data, adds one or more layers of annotations and offers the result to other researchers, either directly or via distributors like ELDA or LDC; and (2.) A student needs a dataset for his or her Ph.D. research on a new task; since the research is original, he or she does not find suitable existing data sources, and a new dataset is thus developed. Sometimes, results are available from the developer of the resource; however, often the result is not shared, especially if the student’s subsequent career is outside academia. On the consumer side, the situation is more diverse. Academic researchers who want to test a hypothesis (often based on a spontaneous idea outside of any official projects), research scientists in industry or natural language engineers in small-&-medium enterprises (SMEs) who are tasked with building a new system or component, and who have to assess the feasibility, estimate time and cost (usually within days), and then go about and build it, usually in a very short period of time (usually a few months). Last but not least, due to the pervasiveness of the Internet there are perhaps many other (often unanticipated) use cases of “LR consumers”, ranging from school teachers picking example sentences of real data for their exams to curious hobbyists interested in language. Two special use cases are worth singling out: replication of experiments and use of prior art (an algorithm published as a paper) as a baseline.i To make researchers more productive, we should increase access to LR without red tape, and here, two alternative methods are discussed:

Web service based evaluation against LRs: For LRs used to evaluate systems, build a Web service from which (parts of) a LR can be downloaded, annotated by a research group’s system, and uploaded with that system’s output. The Web service would then compute a very large set of evaluation metrics (more than the group would themselves have time to implement) and return the resulting scores to the researchers for convenient use in their publications. For information retrieval evaluations, this has been proposed to NIST by the present author (personal communication, 2006), and for speech systems, this has been proposed by Höge (2009) [in Calzolari et al. (2009), p. 67].

Sharing results of computations: In Leidner et al. (2003), we proposed a multi-layered annotation of corpora with output of multiple annotators (say, a set of five different POS taggers and three different parsers). In this model, an evaluation can pick the tag layer produced by a particular tool without having access to that tool itself, because its output has been made available. Furthermore, the availability of alternatives (e.g. multiple parse trees) reduces the risk of tool-specific artifacts, thus strengthening support for hypotheses tested. Whereas the implementation in the aforementioned paper was based on inline XML, today the use of stand-off annotation would be more prudent.

42

The first approach “hides” the LR behind a Web service API for evaluation (all computation, no data), whereas the second approach does the converse (no computation, all data). The second approach would still benefit from an upload/download Web site for user-contributed LRs. Questions

1. Awareness: How can we make sure that somebody searching the WWW for “Mexican Spanish Treebank” either gets lucky quickly, or learns swiftly that there is no such thing?

2. Cycle speed: How can the dissemination of language resources from the developer to its potential consumer be made more efficient?

3. Politics: How can EU project consortia and Ph.D. students be incentivized to make their resources publically available?

4. Science: How can replication of experiments on reference data be facilitated? 5. Automation: How can evaluation against a set of LRs be offered as an automatic

Web service? 6. Red tape: Researchers do not like administrative activities; how can overhead be

cut? For instance, can sample documents be offered for download without paperwork to permit a quick check whether a LR is actually what the researchers are looking for?

7. Cost: how can crowd-sourcing be used to cut cost in LR production? 8. Democratization: How could the users have a vote in what LRs are funded next?

Would it be possible to launch a market where people can submit proposals of what they would like to have (“parallel medical corpus English-French”, “Treebank for Pashtu”, “POS tagged corpus of Chinese”, “Cebuano lexicon with POS and sub-categorization frames”) and the fund the proposals that get most support by the community?

References

Calzolari, N., P. Baroni, N. Bel, G. Budin, K. Choukri, S. Goggi, J. Mariani, M. Monachini, J. Piperidis, V. Quochi, C. Soria and A. Toral (2009). The European Language Resources and Technologies Forum: Shaping the Future of the Mutlilingual Digital Europe. FLaReNet Forum, Vienna, 12-13 February 2009. Technical Report, Istituto di Linguistica Computazionale del CNR, Pisa, Italy.

Leidner, Jochen L., Tiphaine Dalmas, Bonnie Webber, Johan Bos, Claire Grover (2003). “Automatic Multi-Layer Corpus Annotation for Evaluating Question Answering Methods: CBC4Kids” Proceedings of the Third Workshop on Linguistically Interpreted Corpora (LINC-3) held at the Tenth Annual Meeting of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, Hungary, pp. 39-46.

i Both are paramount to empiricism, the hallmark of all sciences since Francis Bacon first championed the use of experimentation, which led to the foundation of the Britain’s Royal Society exactly 350 years ago.

43

http://www.flarenet.eu/

We desperately need linguistic resources! - beased on the users' pointof view -

Satoshi Sekine Computer Science Department

New York University

[email protected]

1 Introduction

The best implementations of large-scale NLP appli-cations (machine translation, information extraction, question answering, sentiment analysis, summarization, dialogue systems, etc.) use a wide variety of linguistic knowledge. Such linguistic knowledge includes, but is not limited to, knowledge about named entities (e.g., people or award names), coreference (e.g., aliases, synonyms and possible list of nouns which can refer to a particular entity), paraphrases, hypernyms-hyponyms, semantic class labels, selectional restrictions for predi-cates, textual entailment, inference, word sense disam-biguation knowledge, knowledge of event sequences, scripts and so on. (Note that we interpret “linguistic knowledge” broadly, and include material that others might classify as “world knowledge”.)

Recently, there has been much work on both the creation and exploitation of linguistic knowledge. Un-fortunately, not many researchers share the knowledge they create. This results in a lot of duplicate effort be-cause researchers working on the same domain must create the same or similar kinds of knowledge to get started. An obvious, but difficult step in this direction would be for the community to share such knowledge on a very wide scale. We believe the problem lies in: 1) lack of places to contribute and obtain the resources, 2) lack of explicit motivation for contributors, 3) corpus size problems 4) lack of time and money to prepare resources for distribution (e.g., quality control, format standardization, provisions for user feedback/revision), 5) few opportunities to combine and reconcile related knowledge sources and 6) lack of communication be-tween knowledge producers and users.

2 Example

First, we will demonstrate why the NLP community needs a broad spectrum of linguistic knowledge. The following newspaper article is from The New York Times, June 6, 2009, reporting an attack on the Interior Minister of the Russian Republic of Dagestan.

Top Official Is Fatally Shot in North CaucasusThe highest‐ranking law enforcement official in the Russian Republic of Dagestan and one of his depu‐ties were fatally shot when a gunman strafed their car with automatic weapon fire as they left a restau‐rant on Friday, Russia's chief prosecutor announced. Interior Minister Adilgerei Magomed Tagirov, left, and his chief of logistics died in the hospital. The at‐tack underlines continuing violence in the North Cau‐casus. Though separatism has been suppressed in Chechnya, clashes between armed militants and au‐thorities are reported on a weekly basis in the neighboring republics of Dagestan and Ingushetia.

One of the goals of the NLP technology is to “un-derstand” this kind of article, i.e., “create a model of text meaning” as determined by some theory or for use by some set of applications. For most models of “text understanding”, it is essential to identify the entities, such as people and locations, appearing in the article and to extract the relationships between them. In this article, there are at least 4 people, and for example, there is a relationship between the two victims, the minister and his deputy. Also we need to identify events among those people, “shot and killed”, “an-nouncement” and “die in a hospital”. These are not easy tasks for machines because of the following prob-lems.

Coreference: For example, there are 5 mentions of Mr. Tagirov, which are “top official”, “the highest-ranking law enforcement official”, “his” (in “his depu-ties”), “Interior Ministor Adilgerei Magomed Tagirov” and “his” (in “his chief of logistics”). Resolving the links between those mentions is a task of coreference resolution. Much research in this area confirms that, in order to adequately address this coreference problem, we need extensive linguistic knowledge. Among the types of knowledge which have been individually demonstrated as beneficial for coreference are gender information (Ge et al. 1998), lexical dependencies (Ge et al. 1998, Yang et al. 2005), nominals used to refer to a named individual (Yang and Su 2007), and event-event argument dependencies (Bean and Riloff 2004) For our example, we need the knowledge that the “Inte-rior Minister” can be referred to as the “highest-ranking

44

law enforcement official”. It is unrealistic to believe that machine learning systems trained on small amounts of data can capture this level of detail.

PP-attachment: Prepositional phrase attachment (and structural ambiguity resolution in general) is a difficult problem, which needs a vast amount of linguistic knowledge. For example, the sentence “a gunman strafed their car with [automatic weapon] fire” has two interpretations; “their car was strafed by a gunman with fire” or “their car with fire was strafed”. It is obviously the first case, but in order to understand it, we need knowledge that “fire” is a typical instrument of “strafe”. This kind of knowledge can be extracted from a large un-annotated corpus (Hindle and Rooth 1993) (Ratna-parkhi 1998). For example, the ngram search engine shows that “strafed * with * fire” has much higher fre-quency than “car with * fire”.

Synonym, Paraphrase and Entailment: In the above passage, the noun “attack” from the phrase “The attack underlines continuing violence…” refers to the same event as the verb “shot” from “were fatally shot”. In order to understand this, the system needs synonym, paraphrase or entailment knowledge. Also, similar knowledge is needed to understand that “fatally shot” entails “died”. Thesauri or ontologies can be used to solve a part of this problem. For example, WordNet has been used in many NLP systems, because of the cover-age, accuracy and availability of the resource for re-searchers. However, WordNet needs to be supple-mented by other methods in order to fully handle syn-onymy, e.g., the relationship of “attack” and “fatally shot” is not directly found in WordNet. In order to in-crease coverage, NLP researchers have automatically extended WordNet in various ways, e.g., (Harabagiu and Moldvan 00), (Snow et al. 06) and (Hovy et al. 09). Also, more improvements in terms of the kind of con-tent are needed, such as context sensitiveness and cross category links (Fellbaum 08). On top of that, semantic equivalence is not limited to the word level (i.e. syno-nym). There are also equivalences at the phrase level (i.e. paraphrase), which are much harder to handle, as the phrases are longer and there are combinatory varia-tions. Several trials have been made to find paraphrase knowledge automatically from corpora (Dekang and Pantel 01) (Barziley and McKeown 01) (Shinyama and Sekine 03) (Idan 2008). Also, hand created lexical re-source, NOMLEX-PLUS (Meyers, et. al. 2004a), can provide the knowledge that the phrase “clashes be-tween armed militants and authorities” implies that “Armed militants and authorities clash” – this lexicon indicates that a “between” PP modifier of the noun “clash” corresponds to the subject of the verb “clash”. These resources, which are not perfect yet, are very useful for the textual entailment task (Iftene 08).

Word Sense Ambiguity: According to WordNet, there are 9 senses of “attack”, including “offensive

against an enemy”, and “intense adverse criticism”. It is the first sense if “a gunman” attacked “a politician”, but it would most likely be the second sense if “a poli-tician” attacked “a politician”. The word sense ambigu-ity should be solved using local and long distance con-texts. Such word sense disambiguation knowledge may be prepared by unsupervised sense categorization using a large corpus (Yarowsky 95).

Attributes of Entity: “Chief of logistics” is a title of a person who can also be referred to as “his deputy” in terms of his relation to his boss, who is the “Interior Minister”. This is a type of knowledge that an attribute of a person is related to the relationship of the person to another person. We need to prepare instances of such knowledge in order to understand the text. Also, sev-eral kinds of attributes can be used to refer to a person, just like “chief of logistics”; these include age (“the young man”) and award history (“the 4 time U.S. Open champion”). The number of attributes could be very large for some types of entity (Sekine LREC07). The list of possible attributes is an important kind of knowledge.

Named Entity and semantic category list: In general, Named Entities are regarded as a very important com-ponent in text understanding (Grishman and Sundheim 1996) (Nadeau and Sekine 2007). Most NE typologies for general domains, such as the newspaper domain, include 3-8 types, such as person, location, GPE (Geo-Political entity) and organization. However, even in the newspaper domain, those entity types cover only a sub-set of the named entities that actually occur in the text. For example, there are event names (e.g. the Beijing Olympics), award names (e.g. the Academy Awards) and product names (e.g. iPod). Although we have had great success in developing named entity recognizers using machine learning technologies for a smaller number of entity types, we have not yet investigated if the same technologies would work for a larger number of entity types. For some categories, acquiring a list of names is crucial. Some NE-like tasks (e.g., ACE entity task) have extended semantic categorization to phrases headed by common nouns. These tasks can be aided by ontologies like WordNet as well as other lexical re-sources like NOMLEX-PLUS and COMLEX Syntax (Macleod, et. al. 1998a), which include lists of com-mon nouns that fall into categories including occupa-tions, human, temporal words, etc.

Name aliases: Name aliasing is very important to identify the same entity across mentions in a single or multiple documents. For example, “Republic of Dages-tan” is also referred to as “Dagestan”. According to Wikipedia, it is also spelled as “Daghestan”. In order to link multiple news sources which may use different spellings, name alias knowledge is needed (Bollegala et al. 08).

45

Entity Relation Knowledge: Knowledge of geo-graphical relationships among “Russia”, “North Cauca-sus”, “Republic of Dagestan” and “Chechnya” are needed to understand the last two sentences in the arti-cle, and knowledge of the political relationship be-tween “Russia” and the “Republic of Dagestan” is needed to understand why the Russian prosecutor made an announcement about the death of the Interior Minis-ter of the republic.

Script and verb relations: It is common sense that af-ter a person is shot, the person may be brought to a hospital and he/she may recover or die. This is called script knowledge or simply a chain of event (or verb) relations. (Chklovski and Pantel EMNLP04) created resources about verb relationships, called VerbOcean. They set five categories (similarity, strength, antonymy, enablement and happens-before) and used lexico-syntactic patterns to extract 29,165 instances which fall into those categories with 65% accuracy. The resource is available for the public. Also, (Chambers and Juraf-sky 08) (Chambers et al. 07) proposed to induce narra-tive event chains from global evidence of the same type of events and to classify temporal relations between events.

3 Conclusion

In short, a wide variety of linguistic knowledge is crucial in order to advance current NLP technologies and create more accurate and useful NLP applications. Many of these types of knowledge have been sepa-rately developed at individual NLP research sites. However, because the needed knowledge is so large and diverse, it is not realistic to assume that it can all be developed by a single institute or a small group of people. So, a collaborative and community effort is needed and sharing resources is crucial.

References R. Barzilay, K. McKeown (2001). Extracting Paraphrases

from a Parallel Corpus. in Proc. of ACL/EACL, 2001, Toulouse.

D. Bean and E. Riloff (2004). Unsupervised learning of con-textual rule knowledge for coreference resolution. Proc. HLT/NAACL.

D. Bollegala, Y. Matsuo and M. Ishizuka (2008) A Co-occurrence Graph-based Approach for Personal Name Alias Extraction from Anchor Texts. Proc. of IJCNLP.

N. Chambers and D. Jurafsky (2008). Unsupervised Learning of Narrative Event Chains. In ACL/HLT.

N. Chambers, S. Wang and D. Jurafsky (2007). Classifying Temporal Relations Between Events. In ACL-07.

T. Chklovski and P. Pantel (2004). VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations. Proc. of EMNLP.

C. Fellbaum (1998) WordNet: An Electronic Lexical Data-base. MIT Press.

C. Fellbaum (2008) Identifying, Finding and Encoding Se-mantic Relations, Symposium on Semantic Knowledge Discovery, Organization and Use.

N. Ge, J. Hale, and E. Charniak (1998). A statistical ap-proach to anaphora resolution. Proceedings of the Sixth Workshop on Very Large Corpora.

R. Grishman, B. Sundheim (1996) Message Understanding Conference - 6: A Brief History (PS) Proceedings of the 16th COLING

S. Harabagiu and D. Moldovan (2000) Enriching the Word-Net Taxonomy with Contextual Knowledge Acquired from Text, in Natural Language Processing and Knowl-edge Representation: Language for Knowledge and Knowledge for Language, (Eds) S. Shapiro and L. Iwan-ska, AAAI/MIT Press, 2000.

D. Hindle and M. Rooth (1993). Structural Ambiguity and Lexical Relations. Journal of Computational Linguistics Vol. 19(1).

E. Hovy and Z. Kozareva, and E. Riloff (2009). Toward Completeness in Concept Extraction and Classification", Proceedings of Empirical Methods in Natural Language Processing.

A. Iftene (2008) UAIC Participation at RTE4. Proceedings of TAC.

D. Lin and P. Pantel (2001) DIRT - Discovery of Inference Rules from Text. In Proceedings of the ACM SIGKDD.

C. Macleod, Ralph Grishman and Adam Meyers (1998a). Comlex Syntax, Computers and the Humanities, 31: 459-481.

D. Nadeau, S. Sekine (2007). A survey of named entity rec-ognition and classification. Journal of Linguisticae Inves-tigationes 30:1

A. Ratnaparkhi (1998). Statistical Models for Unsupervised Prepositional Phrase Attachment. In Proceedings of the 36th conference on ACL.

S. Sekine (2008b) Extended Named Entity Ontology with Attribute Information. In Proc of LREC.

Y. Shinyama and S. Sekine (2003) Paraphrase Acquisition for Information ExtractionThe Second International Workshop on Paraphrasing: Paraphrase Acquisition and Applications

R. Snow, D. Jurafsky, and A. Y. Ng (2006). Semantic taxon-omy induction from heterogenous evidence. In Proceed-ings of COLING/ACL.

Szpektor and I Dagan (2008). Learning Entailment Rules for Unary Templates. In Proceedings of COLING 2008.

X. Yang and J. Su (2007). Coreference resolution using se-mantic relatedness information from automatically dis-covered patterns. Proc. of ACL.

X. Yang, J. Su, and C. L. Tan (2005). Improving Pronoun Resolution Using Statistics-Based Semantic Compatibility Information. Proc. of ACL.

D. Yarowsky (1995) Unsupervised word sense disambigua-tion rivaling supervised methods. Proceedings of ACL.

46

Language resources for information extraction: demands and challenges in practice

Christos Tsalidis Neurolingo, Greece

Demands on language resources for information extraction: our experience In the framework of various R&D activities and projects, we develop language resources (LRs) and tools required for information retrieval and extraction (IE). From our experience in projects mostly related to IE in the Modern Greek language, the IE task demands a large variety of LRs which means that we have to define beforehand:

a. what kind of language data we need and what kind of LRs we have to create for IE, b. what kind of tools and technology components we need in order to use them for IE c. what tasks or applications based on LRs are required in IE.

These LRs are namely: 1. Characters sets:

• alphabets, i.e. Greek, English, and Mixed • identification and classification of valid characters, e.g. letter, dot, symbol, digit,

upper/lower case, etc. • equivalence of characters, for instance optical equivalence between languages, e.g.

English capital A vs. Greek capital A • equivalence of phonemes, e.g. Greek /e/ and /ai/

2. Electronic dictionaries of different types: • gazetteers, special vocabularies and domain descriptors, e.g. anthroponyms (person

names), toponyms (places), company names, professions (job titles), roles, etc. However, a word can belong in more than one domain.

• morphological lexica (i.e. lemma vs. word form): morphosyntactic and stylistic attributes, inflection, syllabification, morphemic segmentation. However, a word can belong in more than one lemma (e.g. different POS) or can have more than one attribute set.

• tterminological lexica (i.e. term vs. lemma): single-word and multi-word terms, term variants, domain. It is necessary to assign a unique id to each term and classify it inside a taxonomy or ontology.

• thesauri (i.e. word sense vs. lemma): synonyms, antonyms, synonym sets • taxonomies and ontologies: semantic categories and relations, inference rules

Our dictionaries are also designed to perform spell checking and fuzzy matching of their entries. 3. Electronic text collections and corpora: especially designed to handle various types of document repositories (i.e. folders, databases, websites) and different types of documents (i.e. text, html, doc, pdf, etc). 4. Grammar rules for the description and detection of morphosyntactic patterns (KANON formalism), such as: multi-word expressions and terms, Named Entities, specific events. In addition to LRs, some statistical information is also required, e.g. computation of TF/IDF factors, BM25, usage of Q-grams or words, especially when expressions to be extracted are not fully linguistic (e.g. biomedical and chemical terms, company names, product names, etc.). In order to deal with the requirements of advanced NLP applications, such as multi-word term identification, Named Entities Recognition (NER), and information extraction, we developed an integrated document collection processing environment, named MNEMOSYNE®. The system

47

incorporates a wide variety of LRs and different types and flows of analyzers, fuzzy matching techniques, and semantic annotations of documents at different levels of abstraction1. A graphical representation of the IE pipeline is given below:

A case study example One of our IE projects already completed was dealing with IE (i.e. Named Entities and Events recognition) from the Greek Government Gazette. An example for the NER <Person> is given below in six steps: 1. Input text (raw) 2. Sentence splitter

3. Tokenisation, lexical identification

<span offset="889" length="165"> <contents>

1. Κυριάκος Μουρατίδης του Θεοφίλου, που γεννήθηκε στη Θεσσαλονίκη το έτος 1952, κάτοικος Θεσσαλονίκης οδός Πλατεία Ναυαρίνου 3, ως Πρόεδρος και Διευθύνων Σύμβουλος.

</contents> <annotations> <tag name="SSEQNO">3</tag> </annotations> </span>

<annotations> <tag name="TTEXT" >Κυριάκος</tag> <tag name="VOCABS">PFName+PSName</tag> <tag name="LEXY">{Κυριάκος,MASC+N+NOM+SING}</tag> <tag name="ORTHO">NrWrd+FcWrd+WthLtrs</tag> </annotations>

<annotations> <tag name="TTEXT">Μουρατίδης</tag> <tag name="VOCABS">PSName</tag> <tag name="LEXY"/> <tag name="ORTHO">NrWrd+FcWrd+WthLtrs</tag>

<annotations> <tag name="TTEXT">Θεοφίλου</tag> <tag name="VOCABS”>PFName+PSName</tag> <tag name="LEXY”>{Θεόφιλος,GEN+MASC+N+SING}</tag> <tag name="ORTHO">NrWrd+FcWrd+WthLtrs</tag>

/ t ti 1 The MNEMOSYNE system has been already used in some important R&D projects for text mining and information extraction from free text documents with very good scoring as regards the size of the input data, the speed of the processing and the output accuracy: for instance, after processing a collection of 28,000 documents in less than 4 hours, an accuracy of 90% on the recognition of ~500,000 events and ~2,600,000 Named Entities has been achieved.

48

4. Syntactic rule (through KANON formalism)

[IRULE="PERSON_3_1", TTEXT=TagPerson("PERSON_3_1","PERSON","%n%s%f",$x1,$x2,$x3)] => \ [TTEXT==$x1, ORTHO->AnyOfOAttrs([FcWrd,AcWrd]), LEXY->HasNoneMAttrs([ART]), VOCABS->AnyAndNoneOfVocabs([PFName],[PExcept])], [TTEXT==$x2, ORTHO->AnyOfOAttrs([FcWrd,AcWrd]), VOCABS->NoneOfVocabs([PExcept])], [TTEXT=="του"], [TTEXT==$x3, ORTHO->AnyOfOAttrs([FcWrd,AcWrd]), VOCABS->NoneOfVocabs([PExcept])] / ;

5. Named Entities Recognition 6. Output text (annotated)

<span offset="892" length="32"> Μετά την αντικατάσταση αυτή η νέα σύνθεση του Διοικητικού Συμβουλίου του οποίου η θητεία λήγει την 26.9.2010 έχει ως <contents>

Challenges and perspectives On the one hand, our activities are mainly related - until now - to Modern Greek which is a less-spoken European language and, at the same time, a very demanding one (highly inflected, free word order, different alphabet, etc.). However, there is a serious lack of available resources for “small” languages; thus, we have to create our own LRs (Morphological lexicon, biomedical lexicon, thesaurus, KANON grammar formalism, etc.) in order to meet the needs of R&D projects. This is highly costly and time-consuming, and also limiting enough, when it comes to projects with higher demands on knowledge representation and management. On the other hand, we are interested in broaden up our activities to multilingual applications. However, we will need some LRs on languages other than Modern Greek (e.g. monolingual or multilingual gazetteers, lists of words, spelling and grammatical information, syntactic patterns, grammar rules, etc.).

Κυριάκος Μουρατίδης του Θεοφίλου κατωτέρω: 1. Κυριάκος Μουρατίδης του Θεοφίλου, που γεννήθηκε στη Θεσσαλονίκη το έτος 1952, κάτοικος Θεσσαλονίκης οδός Πλατεία Ναυαρίνου 3, ως Πρόεδρος και Διευθύνων Σύμβουλος.

</contents> <annotations> <tag name="TTEXT">PERSON</tag> <tag name="IRULE">PERSON_3_1</tag> </annotations> </span>

49

Infrastructures Shooting at a moving target

Peter urg WittenbMPI ‐ NL

Currently infrastructure building seems to be attractive since there were little constraints specified, i.e. creative persons follow their interests. This will change quickly since funders are necessarily interested in an eco-system of lean infrastructures with little overlap between them and since the infrastructure landscape will be more and more constrained by standards, interface specifications, common practices etc. The LRT community is only a small player in this big game.

A time-window was open for almost three years which was used by the LRT community to influence the agenda in a number of areas such as center requirements specification, distributed authentication, flexible metadata, concept registration, standards definition for interoperability, service orientation framework, persistent identifier solution and principles of data management. This time window is rapidly closing for most of the mentioned topics, but we can state that the LRT community has used its chance and pushed forward solutions.

One example where the LRT domain came up with some nice solutions is in the area of service-oriented-architectures. It is a hot area where currently quick glory can be gained by solutions of limited scope. The real challenges with respect to interoperability will come when we speak about an open market place of interoperable services. The challenges that are specific for the LRT community have not been solved and are primarily not of technical nature, but include a tedious interaction and harmonization process. For those challenges which will turn out to be generic again the LRT community will only be a small player, i.e. we will need to adopt solutions from others.

In general we can say that the area of research infrastructures is a very dynamic one where only cost efficient and robust solutions will finally survive. Yet no one can predict which they are, but the LRT community is well-presented.

50

Position Paper for the Session “Services and Functionalities for an Open Resource Infrastructure” at the FLaReNet Forum 2010

Growing resources, raising tools, breeding collaboration

Hans Uszkoreit, DFKI In my position statement, I will not add to the long list of desiderata concerning resources, resource infrastructures and organizational schemes. For my own work the list is already long enough. Today I want to say two things: First, I want to argue for a certain modest model of realization based on customer development and, second, I want to predict and advocate ‘agricultural’ schemes of resource development, that may not be suitable for all or most but certainly for many types of resources. The collective deliberation on language resources organized by FlareNet, sometimes building on results of earlier initiatives and also cooperating with other ongoing actions, keeps producing many thoughtful and sophisticated visions on the nature, organization, maintenance and exploitation of a wide range of data, specifications, standards and tools. In this process, the international dialogue has gotten further and further ahead of reality. This is a good outcome and it is to be expected when at FlareNet gatherings and many other mee-tings intelligent visionaries meet who do not have the sufficient time or financial resources to turn their ideas into reality. This is not to say that the scientists meeting here and before do not also contribute to the growth of the resource landscape. Actually, many do. However, as in any planning process, the gap between vision and reality has widened. This is a perfectly normal development: Two types of creative processes drive innovation: ambitious and sophisticated collective visions on the one hand and courageous but much simpler efforts with tangible disruptive results on the other. Some visions simply are too big to be realized in one big swoop. The visions feed the desires of the pioneers and fertilize the imagination of the doers. The actual steps forward create the real market and therefore also prepare the market pull needed for subsequent steps. Examples for the grand visions are Nelson’s and other’s ideas of hypertext in contrast to first HTML. By the way: For the success of the vision it does not matter whether in the end all its components have become realized. Since in the process of stepwise progress, new innovative elements come in, the sophistication keeps growing anyway. Today’s web has powerful Java applets without ever realizing the much simpler symmetric hyperlink. A nice example of a courageous practical push is the PennTreebank without many features linguists would have liked to add. Another influential push was the simple BLEU score, which now many of us would like to see replaced. When I combine all the thoughtful visions and concrete ideas I heard at the last FlareNet Meeting and that I am hearing here at this event plus the largely overlapping and nevertheless somewhat differently focussed plans of CLARIN and other initiatives, I could easily draft a proposal for a 15+ Million Euro project. But such a project doing it all at once would certainly be doomed to failure. If the first step is too big, you may end up at a spot where there are no customers except maybe for some visionary enthusiasts. If the step is too small, the value for the customers may be too small for them to change their ways. Not only many overambitious infrastructure projects have failed in this way, many highly ambitious and well-resourced start-ups and product plans have bitten the dust after

51

2

successful arrival at the first milestone. This is described quite well in an extremely insightful but amateurishly-produced book called “The Four Steps to the Epiphany” authored by a serial entrepeneur who after several cycles of reincorporation has apparently reached the highest level of enlightenment ever bestowed on a mover and shaker from the corporate world. Some of you may have seen the book and I hope that you did not underestimate it as I first did. The author Steven Gary Blank explains the pitfalls of traditional product development and he introduces a new concept of customer development. Without being able to explain the ingredients of this recipe within my speaking slot, let me just say that his strategy appears to be just made for the process of realizing an open, scalable infrastructure for interoperable language resources. The strategy is built on the notion that large scale marketing and even company building come after the processes of customer discovery, customer validation and customer creation <sic!>. One of the goals of a new project called T4ME is to realize at least some of the visions collected, developed and transmitted by FlareNet. Thirteen experienced organizations from eleven countries have teamed up in a brand new network of excellence, less than two weeks old, funded by DG INFSO under FP7:

• German Research Center for Artificial Intelligence (DFKI) - Coordinator

• Barcelona Media – Centre d’Innovació Media (BM)

• Consiglio Nazionale Ricerche Instituto di Linguistica Computazionale “Antonio Zampolli” (CNR)

• Institute for Language and Speech Processing, R.C. “Athena” (ILSP)

• Charles University in Prague (CUNI)

• Centre National de la Recherche Scientifique Laboratoire d’Informatique pour la Mécanique et les Sciences de l’Ingénieur (CNRS)

• Universiteit Utrecht; Netherlands (UU)

• Aalto University, Finland (AU)

• Fondazione Bruno Kessler (FBK)

• Dublin City University, Ireland (DCU)

• Rheinisch-Westfälische Technische Hochschule Aachen (RWTH)

• Jožef Stefan Institute (JSI)

• Evaluations and Language Resources Distribution Agency (ELDA)

The NoE is dedicated to the technological foundations of the European multilingual information society.

1. By building bridges to neighboring technology areas the NoE will approach open research problems in collaboration with other fields such as machine learning, social computing, cognitive systems, knowledge technologies and multimedia content.

2. By developing a long demanded open resource infrastructure (ORI) the NoE will

create an important prerequisite for developing the necessary technologies and applications for the multitude of European and other relevant languages.

52

baroni

Barra

3

3. By forging an alliance of researchers, commercial technology providers, corporate users, language communities and other stake holders and by developing together with these partners a shared vision and a strategic research agenda the NoE will prepare a large joint effort needed for realizing the open multilingual single European information space.

Although our heart is with all of the visions presented here, I can promise that we will not try to realize them all. Together with advisors from outside, our consortium will try to find the minimally needed realization steps for continuing the process together with the customers. A central strategy will be what Blank calls customer development a notion, I did not know when we lined out this plan in our proposal. We will NOT plan the entire inventory of functionalities at once before starting the implementation process but we will develop our first product together with the development of our customers. This means. first we will just plan a core functionality that will then enable us to work with a hopefully growing customer base on the next steps. My second point concerns a class of resources and resource/tool stacks that are not first specified and then produced by a team according to a specification–in the hope that they will be accepted by the customers–but that grow and develop organically according to the taste of the customers starting from some seed. Let me mention as an example the evergrowing base of data and tools created and maintained by the project EuroMatrix(Plus) funded by the EU DG INFSO. It all started with parallel European Parliament data and Philipp Koehn’s Moses system. In the meantime, a complex biotope has evolved containing many open-source tools and systems for alignment, pre-processing, translating, visualizing, evaluating etc. Every year, numerous reseachers add to these tools. Many additions come from the annual EuroMatrix MT marathons. In addition to the Europarl data, many new data have been added, among them new training and test data and translations of test data by more than 80 research and commercial systems. Many of these translations were evaluated by one or by many evaluation methods and metrics, even including human evaluation. All these data are also freely available for research. The evaluations have been compared and correlated by meta-evaluations. The translations of systems have been combined in combo-systems whose translations are also added to the base. Finally many of the participating systems are described in papers that are also available. Smaller but similar biotopes can be found in the area of treebanks together with search and annotation tools and with multiple layers of additional annotations contributed by a multitude of centers and individuals. I’d like to predict that just as in other disciplines such as biology, astronomy and climate research, there will be an increasing number and volume of resources that will grow organically. These could be fed by researchers, companies and even outside participants, such as translators, language teachers and lay contributors. Let’s stay for a moment with the agricultural metaphor. The growing resources need seeds, soil, water and sun and sometimes also some gardening to create the rows of planting, straighten the plants and to take out weeds.

53

baroni

Barra

4

Seeds are usually provided by some customers, i.e. researchers who know what they need. The soil, a suitable IT infrastructure, needs to be provided. Water such as funding does not have to come from a central source, a mix of rain and dedicated watering from several sources can be sufficient. Now let sunshine stand for incentives in addition to sharing in the eating. Ways have to be designed to give visible recognition to contributors so that there is some payoff for contributing more than the average participant. New schemes need to be developed for fostering such organically-grown resources, e.g., schemes to use flexible licensing for resources that may grow in unexpected ways, to allow mixed-licensing for combinations of commercial and non-commercial providers and consumers, to create incentives and to work with new ways of attribution, and to secure permanence and sustainability. In EuroMatrix we could witness that the organically growing biotope of resources and tools provided a hotbed for new forms of collaboration. Among them were collaborations between reseachers within the consortium but also new collaborations with partners outside. Most of the outside partners are researchers but some are commercial companies, others are large translation-technology users such as the Directorate General for Translation of the European Commission. To conclude my brief statement with a simple program: From real needs via grand visions to carefully planned first steps and then together with the customers (i.e., with other people like us but maybe not quite like us) onward to an evolutionary development of a self-adapting structure. A structure that is powerful and flexible enough to accommodate biotopes containing new and dynamically growing types of resources.

54

baroni

Barra

Linguistic Awareness on the Web

Maria Teresa Pazienza

ART Research Group, Dept. of Computer Science, Systems and Production (DISP) University of Rome Tor Vergata

[email protected] In the last decade the way people think, learn and behave has been revolutionized by the advent of Internet allowing to share facts, events, knowledge, opinions on the Web. As a consequence we are assisting to a growing amount of mashups providing dynamic view on multimodal data: natural language is the unifying language for multimodal data description. Moreover, the expanded diversified user base brings new expectations for programmability and usability from larger, broader, less-specialized community of programmers ( whishing for a set of services that can be embedded, if needed, in several application contexts) An overall perspective on the task as well as a proposal for an architecture providing instruments for supporting on the Web the flow of information (from the acquisition of knowledge from external resources to its exploitation into several kind applications) is still missing. The “information” we refer to, mostly deals with diverse forms of “narrative information sources”, such as text documents (or other kind of media, such as audio and video) or to more structured knowledge content, like the one provided by machine readable linguistic resources. They comprehend lexical resources (e.g. rich lexical databases such as WordNet), bilingual translation dictionaries or domain thesauri, text corpora (from pure domain-oriented text collections to annotated corpora of documents), or other kind of structured or semi-structured information sources, such as frame-based resources . We must move in the direction of a shareable and large-scale exploitation on the Web of : -Language Processing Services -Linguistic Data on the Web Semantic Web is going to provide us with billions of triples of Semantic Data about everything. Special resources/vocabularies (ontologies) will provide unified ways for accessing their content; but, what about: -Interaction between agents/services based on heterogeneous vocabularies? -On-the-fly access to published data defined upon vocabularies for which there is no dedicated application, and whose names are written in unknown (natural) languages? Linguistic Web may support Semantic Web in several aspects: a) free and distributed linguistic services for linguistic data provision and language analysis support information creation and submission to the Web of Data, b) linguistic services may be reusable in general-purpose systems as well as in application oriented systems thus supporting speed prototyping of new special-purpose systems, c) support for Semantic Coordination between agents through “Linguistic Coordination” Services: a set

of services that can be embedded, as needed, in several application contexts. A comprehensive study and synthesis of an architecture for supporting easy access to “information” on the web motivated by the acquisition of knowledge from external resources, has not been formalized until now. For facilitating interaction with user community, interoperability, infrastructure efficiency and sustainability it could be helpful an open resource infrastructure over which exploiting linguistic resources. Different Linguistic Resources (LRs) are available on the Web. These resources differentiate upon: Trustworthiness: from free initiatives to coordinated research projects Complexity: quantity and quality of detailed information, adopted model, morphology… Representation: no standard for representation of linguistic resources Implementation: available as databases, huge xml repositories, proprietary text formats etc.. We need a solution for that. Let me think to something like the Linguistic Watermark (M. T. Pazienza, A. Stellato, A. Turbati - Linguistic Watermark 3.0: an RDF framework and a software library for bridging language and ontologies in the Semantic Web – SWAP 2008); initially thought as a modular classification schemata for driving software libraries in accessing LR (limited to lexical) in a coordinated framework, it is currently a sort of ontology-driven framework for: -characterizing linguistic resources -describing “linguistic expressiveness” of ontologies -describing integration between ontologies and LR with a set of associated software libraries for: -providing access to heterogeneous LR -supporting integration between ontologies and LR, then evaluating such an integration

Conclusion While traditional research fields such as Natural Language Processing and Knowledge Representation/Management have now found their standards, cross-boundary disciplines between the two need to find their way towards real applicability of approaches and proposed solutions. What we need is a frameworks in which final linguistic service is not a « product », but something that -Can be sold at low prices to thousands users -Can be easily personalized (semantically) -Can actively involve a company (in a cheap way) in activities of maintenance, monitoring, adaptation .

55

http://art.uniroma2.it/publications/docs/2008_SWAP2008_LinguisticWatermark3.0.pdf



Will Language as a Service (LaaS) increase the interoperability in language resources and applications?

Virach Sornlertlamvanich

Thai Computational Linguistics Laboratory (TCL), NICT and

National Electronics and Computer Technology Center (NECTEC), Thailand [email protected]

Today, the achievement in research on natural language processing is crucial to handle the drastically increasing information in the society. The Internet is already a common platform where we post and exchange the information. A set of powerful language tool is therefore necessary to support our needs. The language tools may cover some basic components such as word segmentation, soundex, language identification, up to a solution of machine translation, or search engine. A kind of basic language resources such as dictionary, ontology, corpus are also needed to support the language works. In terms of Language as a Service (LaaS), we propose a framework to provide such a kind language processing in a set of web service. The user may get access to the required services through a specific web service protocol. The solution provider can also integrate the service components to realize a higher service solution. Moreover, such a kind of web service platform can facilitate the comparison study of the language tools and resources. By virtue of the availibilty of Open API for the web service, the resources will be kept interoperable and they can be access in a more flexible manner. Any standard will be the standard that ones can follow. In the initiative stage, we provide some services on Thai word segmentation, Asian WordNet, language identification and soundex at http://www.tcllab.org/nlpws. The services can be utilized on the web, or by registration to get access through the web service.

56


http://www.tcllab.org/nlpws

Toward a Standardized Set of Language Service Web APIs Yoshihiko Hayashi1 Osaka University, Japan

LRaaS The notion of service in the world of software has been becoming more important as illuminated by the terms such as SOA (Service Oriented Architecture) or SaaS (Software as a Service). In parallel with this, service-oriented language infrastructures to push forward the notion of LRaaS (Language Resource as a Service) have come to the front. Such an infrastructure enables:

More non-expert users to have accesses to LR/LT, without being too much bothered by cumbersome IPR issues. A virtual/dynamic language resource to be realized as a language service through useful

combination of the existing language services.

Language services on the Language Grid The Language Grid is a multilingual language service infrastructure on the Internet whose primary goal is to provide solutions enabling the above mentioned environment. Envisaged majority of the users are non LR/LT experts who are involved in the activities of intercultural collaboration. It currently accommodates more than 60 Web services: each of them is classified into one of the around twenty service types. Service type is very important in the sense that it defines an API. Figure 1 sketches the current figure of the service types on the Language Grid, from which we can observe the followings:

Resource access‐dictionary (bilingual, concept)‐corpus (parallel text, adjacency pars)

Linguistic Analysis‐morphological(tokenization, POS tagging)‐syntactic (dependency structure)

Translation‐general‐back translation‐multi‐hop‐with‐temporal dictionary

f: text ? text

f: query ? partial resource f: (annotated) text ? annotated text

Figure 1: Language service types.

According to the primal functionality, they can be roughly classified into (1) Translation, (2) Resource access, and (3) Linguistic Analysis. This classification corresponds to the abstract typing of the service input/output data as also shown in Fig.1.

Translation services are particular of importance in the Language Grid. Several service types have been introduced that are useful in alleviating the difficulties in MT-based cross-lingual communication; including back translation (to conjecture the translation quality), and multi-hop translation (to support more language pairs as originally provided).

It still lacks some important service types such as: media conversion (e.g. speech recognition or text-to-speech), text generation (from an internal representation), or corpus statistics/machine-learning. New service types however will be incorporated, whenever they are identified by a LR/LT provider and/or a user, and acknowledged by the operator.

1 Acknowledgments: This work was supported by Strategic Information and Communications R&D Promotion Programme (SCOPE)

of the Ministry of Internal Affairs and Communications of Japan. He also thanks Toru Ishida and Yohei Murakami for their helpful

discussions.

57

A possible use case scenario Figure 2 depicts a possible near-future use case scenario, where a group of people jointly work to realize tailored translation services. To implement this scenario, the community documents are first analyzed; then multilingual term lists are extracted under user supervision. Moreover, some of the resulted translations may be selected by the user, and formulated as a translation examples resource. These community-based resources are then fed to the translation services to improve their qualities. The points here are twofold: (1) The workflow as a whole functions to create new language resources and services by employing existing and newly created language services, and (2) Human involvements are crucial. As seen in this example, some language services that have not been considered should be first identified. Use case studies like this will proclaim missing service functionalities in an infrastructure, and reveal dimensions to organize services of various kinds, which are essential in organizing ontological knowledge discussed below.

Term Extraction

term‐list Multilingual Term‐list Creation

Bilingual Dictionary

multilingual term‐list

Translation withTemporal Dict.

Example‐BasedTranslation

Communitydocuments

Bilingual Dictionary

Concept Dictionary Web search engine

translation examples

Translateddocuments

Example‐listCreation

Figure 2: A community-based use case.

Steps toward a standardized set of language service Web APIs Figure 3 illustrates necessary steps toward the goal, where we have to:

identify possible language service types. To this end, bottom-up activities in the LR/LT field, a nice example is LREC2010 language resource map, and application-oriented/top-down activities should be coordinated.

classify and describe the service types. We first have to clarify the dimensions of classification. Obviously, input/output linguistic data type and language processing functionality are two important things. We then need to organize ontological knowledge that includes a taxonomy of application-oriented use intentions as well as LR/LT domain ontologies: these domain ontologies can partly be organized by basing on the relevant international standards for linguistic data modeling.

Identifying service types

Classifyingservice types

Describingservice types

Facilitating theWeb‐servicization

datatooldatadata

tooltool

Language Resources

C

facilitate the Web-servicization. More concretely, we have to provide some supports for implementing a wrapper for each language resource. Ontological knowledge to classify and describe language services might be a key to provide a set of wrapper templates at the moment, and to enable semi-automatic wrapper creation in the future.

To go through these steps, we definitely need to establish a positive loop of service creation and consumption. An inter-infrastructure collaboration should be first initiated to identify a fundamental set of language service types. In parallel, we need to establish more connections with potential user communities of various kinds to discover novel service functionalities.

ommunities

OntologicalKnowledge

standardized APIs

Languageces

Metadata,Profile forLanguageServices

ServiFigure 3: Steps toward standardized service APIs.

58

S3. Sharing or not Sharing: Availability and Legal Issues 17:00-19:00

Chair: Khalid Choukri – Rapporteur: Núria Bel


Introductory Talks

“TAUS Data Association - A non-profit platform for sharing translation memories” Jaap van der Meer (TAUS)

“…” Luis Collado (Google Inc., SP)

“Sharing or not sharing information, that is the legal question?” Isabelle Gavanon and Lina Su (FIDAL, FR)

Contributions

“Actual publisher’s proposal for meeting translators’ current linguistic resource needs” Marie-Jeanne Derouin (Langenscheidt KG, DE)

“Releasing lexical resources as open source - pros and cons” Bolette Pedersen (Københavns Universitet, DK)

“HLT-resources - sharing, access and payment” Torbjørg Breivik (The Language Council of Norway, NO)

“Implications of a Permissions Culture on the Development and Distribution of Language Resources” Denise DiPersio (University of Pennsylvania - LDC, USA)

“Data Collectors, Data Hunters, and IP Issues” Daniel Grasmick (Lucy Software and Services GmbH, DE)

Discussants

Luca Dini (CELI, IT)

Yohei Murakami (NICT, JP)

Alexandros Poulis (European Parliament, LUX)

Steven Krauwer (Universiteit Utrecht, NL)

61

TAUS Data Association

A non-profit platform for sharing

translation memories

Jaap van der Meer, director TDA

The pursuit of dominance is natural in any

competitive market. This may be achieved

through control of the ‘pipes’ or the ownership

of the resources. The largest players in the

translation industry have battled in the past

decade for ownership of the pipes or control of

the infrastructure (i.e. the workflow, the

Globalization Management Systems, the TM

technology) as a way to possess customers and

market share. Success has been limited. No

single company has reached dominance and

buyers are constantly seeking independence

and flexibility.

We predict that the battle is now redirected

towards ownership of resources or what we call

the ‘lingua’ (i.e. the translation memories and

the translators). The ‘lingua’ resources form the

most valuable component for the realization of

‘translation out of the wall’ and the delivery of

high value personalized communication.

Whoever controls the ‘lingua’ holds the key to

the future of the translation industry, and more

than that: the success of globalization. The

stakes are very high and contenders realize that

they have the Zeitgeist against them. The trend

is openness and industry collaboration.

Therefore we also predict that this battle will be

neutralized and settled through the support to

a single industry-owned platform for sharing of

language data.

In fact this platform already exists and is

gaining more and more support. The TAUS Data

Association (TDA) super cloud of translation

memories will turn into a common industry

operating system for all localization

stakeholders. General industry agreement to

support a single common platform will unleash

the drive towards innovation and growth in the

global translation industry. Established business

models will be under pressure to change. New

blood and ideas will enter the translation space.

Translation business will flourish and grow like

never before.

63

About TAUS Data Association

TDA is a non-profit organization providing a neutral and secure platform for sharing language data. Share your translation memories and in return get access to the data of all other members. TDA is a super cloud for the global translation industry, helping to improve translation quality, automation and fuel business innovation.

www.tausdata.org

64

http://www.tausdata.org/

4th February 2010

Sharing Or Not Sharing? That Is a Legal Question

Isabelle GAVANON Managing Partner, Intellectual Property and Information Technologies, FIDAL,

University of Paris II Assas, FRANCE and

Lina SU Lawyer

Nowadays, the need for using and sharing content has become obvious to allow the development and improvement of human language technologies. As information search systems evolve, Language Resources have become more and more sophisticated and profuse on the World Wide Web, thus requiring considerable accommodation efforts. As content, Language Resources could be legally protected against certain action as reuse, copy, extraction, modification, etc. Prior to any possibility of sharing, many legal issues have to be taken into consideration and solved.

Indeed, a myriad of legal issues arise from making language data available to the public, based on European law, international law as well as specific national law. Clearing all IPR (Intellectual Property Rights) and other legal issues and drawing best licensing practices are the matters of this presentation. A series of legal issues need to be considered in case of the reuse of language resources.

1. Legal issues raised by European law and International law

• Protection of original works: the Berne Convention Implement Act of 1886 was the first agreement to recognize Copyright at the international level. Any original work is eligible to be protected by Copyright Laws. Copyright does not cover ideas and information themselves, only the form or manner in which they are expressed. As a general principle, any use of the protected work must be authorized by the copyright holder, although few exceptions exist which allow uses without prior consent. At a international level, at least two major conceptions of the rights of creators to protect their creations have come to the fore: the Copyright in Common Law countries and the French Authors’ rights in the continental countries (“Droit d’auteur”). This is important in the light of Part 3 hereunder.

• Protection of databases: a “sui generic” right was introduced in the European Community by the directive 96/9/EC on the legal protection of data. This protection is granted if a qualitatively and/or quantitatively substantial

65

4th February 2010

investment is made to obtain, verify, and present database content. The database rights holder is protected from any unauthorized extraction and/or re-utilization of the whole or a substantial part of the database. In November 2004, the Court of Justice has clarified the scope of the sui generic right. The most important aspect of the Court’s ruling is that investment in creation of data does not trigger the sui generic protection. In essence, this means that the creator of a database who is the creator of the materials contained in it is entitled to claim protection only if he can show substantial investment in the obtaining, verifying or presenting the contents and not in creating the contents of the database.

• Protection of personal data: the Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data regulates the processing of personal data within the European Union. “Personal data” means “any information relating to an identified or identifiable natural person”. According to this Directive, personal data should be processed only if certain rights of individuals are respected. These conditions fall into categories: transparency, legitimate purpose and proportionality.

All rules mentioned herein are applicable in each EU Member State.

2. Legal issues raised by the French law

• Protection against unauthorized intrusion and maintaining in all or part of an automated information system: unauthorized intrusion or maintaining in an automated information system is punished by a sentence of imprisonment and fine on basis of articles 323-1 and seq of the French Criminal Code. Any kind of irregular intrusion in an automatic data processing system is considered by the French Court as a fraudulent access, notwithstanding the system is protected or not by firewall and other technical means.

• Protection against free-riding: free-riding occurs when one firm benefits from the investments and efforts of another firm or individual without paying for or sharing the costs. This kind of practice is prohibited by the article 1382 of the French Civil Code. Settled case law has specified that there is no need for the free-rider to be a competitor in trade or to characterize any likelihood of confusion between two products.

• Protection against the theft of information: French Court ruled in 2008 that fraudulent copying of secret and non disclosed computer files of a firm constitutes a theft of information.

3. New licensing practice

66

4th February 2010

A number of free licenses have been set up to enable the use, study and sharing of specific works, as well as to allow personal additions onto an original work and the dissemination of the resulting work in the same free-oriented spirit. Therefore, free licenses tend to allow free dissemination and are sometimes not compatible with restrictive or exclusive distribution patterns. It is worth observing that although that type of license is called “free”, this does not mean necessarily that the product distributed through such a license can be obtained “free of charge”.

We identified licenses of the Creative Commons as the most used free licenses specific to language resources. The best known users of Creative Commons licenses are Google, Wikipedia, Whitehouse.gouv, Public Library of Science, Flickr… Creative Commons licenses have a function of templates. They allow anyone to instinctively draw up his/her own license with his/her own options, instead of looking for several existing licenses to met one’s needs, through a simple yet explicit representation of available options. There are six major licenses of the Creative Commons that differ from each other on four major options:

• Attribution: right to copy, distribute, display, and perform the copyrighted works but only if the licensee gives credit the way the original author requests;

• Share Alike: allowing other to distribute derivative works only under the same or a similar license;

• Non-Commercial: requiring the work is not used for commercial purposes;

• No Derivative Works: allowing others to copy, distribute, display, and perform only verbatim copies of the copyrighted work, not derivative works based upon it.

This presentation will identify implications of the choice of the different options, showing at the same time their limitations and specific requirements in the eyes of the European domestic Law such as French law.

67

Actual publisher’s proposal for meeting translators’ current linguistic resource needs

Marie-Jeanne Derouin

Langenscheidt KG Times are changing for translators and for dictionary publishers: new needs, new media and especially the increasing role of the Internet have changed the relationship between translators and publishers. Our dictionaries are no longer translators’ sole companions, and they now meet only a small part of their needs. The Internet is the most widely used resource and consequently dictionary publishers need to review their product and marketing strategies if they wish once again to play a significant role as content providers for translators. This will require a “transformation” in our approach, geared towards addressing the following major challenges by providing: - a large amount of regularly updated bilingual and multilingual dictionary data in general and specialist language on all possible media used by translators including print and electronic devices - a greater range of information, for example term definitions, contextualized examples, references to standards and pragmatic information specific to terms, in addition to the possible translations in the target language - reliable dictionary data based on validated sources and compiled by experts in the various subject areas and in lexicography - easy access to these data either in print form or on electronic devices, online or offline, and with a fair pricing policy. To this end, we have adopted a step-by-step strategy over the course of the past few years which will be shortly described in this paper.

68

FlareNet 2010: Releasing lexical resources as open source – pros and cons

Bolette Sandford Pedersen Center for Language Technology (CST), University of Copenhagen

[email protected]

1. Introduction This paper discusses some experience that we have gained at the Center for Language Technology at the University of Copenhagen during the last years regarding the application of different sharing politics for lexical resources. In March 2009, the first version of DanNet – a wordnet for Danish comprising currently 50,000 synsets and more than 250,000 semantic relations – was released under an open source license similar to that of the Princeton WordNet License. This dissemination strategy can be seen in contrast to former strategies applied by the Center for Language Technology where we have developed specific payment licenses for each resource product. Both strategies have their pros and contras which we will sketch out for in the following.

2. Licenses applied for Danish lexical resources

A lexical resource previous to DanNet, namely Sprogteknologisk Ordbase (cf. Braach & Olsen 2005) is being distributed under specific licenses depending on whether they are meant for research or commercial purposes. The research license includes the following central restrictions:

“The user of STO pays CST the actual

expenses related to excerption of data in accordance with the specifications of the user. CST gives the right for the user to use the lexical data for research in relation to project XX for the period XX. Ownership and copy right remain at CST. Data can only be used by the user, XX, and the user is committed to protect data so that they cannot be applied by third party.”

Danish commercial enterprises on the other

hand can experiment freely with STO data for three months, but after this period they pay a yearly maintenance fee for each requested standard excerpt of the data. Special excerpts of data require a specific license agreement. Sale of STO to international customers is performed via ELDA (Evaluations and Language Resources Distribution Agency).

In contrast, the open source license for

DanNet, which basically follows the Princeton WordNet license, specifies the following instructions regarding commercial use (and no restrictions regarding use for research):

“(…) Permission to use, copy, modify and

distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution.

The software and database is provided "as is" and University of Copenhagen and Society for Danish Language and Literature make no representations or warranties, express or implied. by way of example, but not limitation, University of Copenhagen and Society for Danish Language and Literature make no representations or warranties of merchantability or fitness for any particular purpose or that the use of the licensed software, database or documentation will not infringe any third party patents, copyrights, trademarks or other rights.”

3. Pros and cons of sharing data without pricing

One of the aspects that motivated us to go

open source with DanNet data, was previous, only modest use of STO data in commercial applications. Only three STO licenses have been sold during the five years that the resource has been available.

Regarding license policy for DanNet, it turned out that both background providers to DanNet, namely University of Copenhagen with the SIMPLE lexical resource as background, as well as The Society for Danish Language and Literature with The Danish Dictionary (Den Danske Ordbog) as a very central background resource for DanNet, could agree to make the resource open source. The only restriction made on the resource was made on the definitions which had all been compiled via The Danish Dictionary to DanNet. Society for Danish Language and

69

Literature made a restriction of only 25 characters of each definition in order to protect the copy right of The Danish Dictionary in which definitions constitute a substantial part. A clear positive effect of DanNet going open source is that other open source enterprises can use and integrate the data in new applications without any legal or economical complications. For instance, DanNet is now integrated in the Danish version of OpenOffice’s word processor and used as a flexible synonym look-up facility including synonymy as well as broader and narrower terms. Another example is a – to us at that time unknown – student’s intelligent leisure time activity of making the freely available DanNet data visible in an intuitive fashion by means of a browser. Cf. Figure 1 for a screen dump of a DanNet look-up with this facility. Download and filling of licenses and signature forms – had they even been free – would most presumably have been enough obstacle to prevent this student from getting started in the first place. The browser now has more than 1,000 users per day and proves to be the triggering factor for several new users of the resource. In this way, knowledge of the resource seems to spread like ripples in a pond since data are now both easily and intuitively accessible.

Figure 1: DanNet browser visualizing the concept sko (shoe) and its semantic relations.

Negative effects of an open source license like the one we have adopted, is that you more or less lose control of the resource, and that you cannot avoid commercial use of the data that you are probably not so happy about, either because you don’t find the quality of the product convincing or because you find that the product has not been developed in the right spirit. Even if you have been paid for your effort in the first place, seeing other people earning money on your product can be annoying, especially if you lack funding for maintenance and further developments of your resource. To conclude, however, we find that the positive effects of sharing computational lexical resources

for free overrule the negative effects since it is satisfactory to see your resource being of benefit for research as well as for commercial products. Developing Danish language technology is never going to be a gold mine, whatsoever, so if a resource like DanNet can trigger the development of new, useful and hopefully visionary language tools for Danish, we should evaluate it positively. It is to be discussed to which extent such a strategy is relevant for all languages or primarily for the lesser spoken ones where the commercial markets cannot alone drive the field of language resources of the future.

4. References Braasch, Anna & Olsen, Sussi: STO: A Danish

Lexicon Resource - Ready for Applications, 2004 In: Fourth International Conference on Language Resources and Evaluation, Proceedings, Vol. IV. Lisbon, pp. 1079-1082.

Pedersen, B.S, S. Nimb, J. Asmussen, N. Sørensen,

L. Trap-Jensen, H. Lorentzen (2009). DanNet – the challenge of compiling a WordNet for Danish by reusing a monolingual dictionary. Language Resources and Evaluation vol. 43: 269-299.

DanNet browser (2009) = Anders Johannsen:

http://andreord.dk DanNet website: http://wordnet.dk. Den Danske Ordbog = Hjorth, E., Kristensen, K.

et al. (eds.). 2003-2005. Den Danske Ordbog 1-6 (‘The Danish Dictionary 1-6’). Gyldendal & Society for Danish Language and Literature.

70

http://cst.ku.dk/sto_ordbase/stoartikler/lrec_2004.doc



http://andreord.dk/

HLT-Resources – Sharing, Access and Payment

Torbjørg Breivik The Language Council of Norway, [email protected]

- Background

After ten years of planning and discussions The Norwegian Ministry of Culture and Church Affairs has granted funds for establishing a Norwegian HLT Resource Collection. The Norwegian National Library will be responsible for developing the collection, and the process was initiated on 1 January this year.

- Principle

The HLT Resource Collection is based on public ownership through the Ministry. Those wanting to utilize resources from the collection should not be charged more than the amount of hours spent on collection and preparing the material in question. Users/customers are expected to be both commercial and from the research institutions.

- Access and availability

On a web page open to the public potential users may look into short extracts of the contents and read documentation on type of resources, kind of tools, file formats, mark-up languages and system, volume size, how to behave to get access, agreement forms etc.

- User rights, not copy rights

No user will have exclusive user or owner rights to the resources. Nor will anyone get copy rights, i.e. no one will be permitted to resell the resources to a third part. If users develop the resources or any of the connected tools further, they will be asked to return the improved result in order to help others improve their results. The idea is to all gain from cooperation – acknowledging that HLT products and services are as expensive for small languages as for the larger ones.

- Pay costs, not the real expenses

The intention is for the users of the HLT collection to pay for some of the running and managing costs of the collection, not for the real expenses and for developing the collection. For commercial use there will be one price and another for researchers, but no user is expected to pay the real costs connected with the resources. A collection like this will need new and updated contents and tools to serve its purpose. The Ministry will carry these expenses.

- Competition – and international cooperation

In Norway the charges for companies who wish to develop products and services for Norwegian by using human language technology will be moderate. When it comes to international competition, cooperation and use of Norwegian language resources, there are some challenges (to solve): What if an international company wants Norwegian language resources for developing products and services for the Norwegian market? And how do we handle the ELDA membership and pricing policy if ELDA wants Norwegian resources in their collection? We sell resources to a very low cost compared to the ELDA-prices because the Ministry pays most of the costs, and ELDA may resell them. ELDA has a policy based on what the members have agreed to, and has to cover its own expenses by the income from selling language resources. Our challenges arise when you get public funding as we do in Norway. As there are several small languages in the world, such questions must be discussed.

71


Implications of a Permissions Culture on the Development and Distribution of Language Resources

Denise DiPersio,

Linguistic Data Consortium University of Pennsylvania, Philadelphia, PA USA

[email protected]

The Philosophical Construct The legal issues that language resource creators, distributors and users face today are rooted in the larger, philosophical crisis of information creation and sharing. That crisis is described by U.S. legal scholar Lawrence Lessig in his work Free Culture as one pitting a “free culture” (one that has property but is “as free as possible from content of the past”) against a “permissions culture” (dominated by powerful rights holders and rights creators). In Lessig’s view, the latter is gaining ground on the former to the detriment of creative minds in all disciplines. Motivated in large part by the instantaneous global reach of the Internet, rights holders (mainly copyright holders) have successfully narrowed the notions of public domain and fair use to appropriate a growing permissions-ruled space. Lessig posits the diagrammatic relationship below between the individual (the dot in the middle) and potential regulators (the ovals) where the regulating factors interact with each other, but law, because it affects all spheres, has the ability to alter the degree to which norms, the market and architecture (defined as the “physical world”) impact the individual:

The Language Resources Effect The implications of Lessig’s model on the creation, use and distribution of language resources can be observed in the following areas. International conventions and national laws. Language resources (LRs) are often multinational; they are created in one place, distributed in another and used in yet another. Who has jurisdiction over the resource, over users, over infringing use? The Berne Convention for the Protection of Library and Artistic Works, administered by the

72

World Intellectual Property Association (WIPO), is the principal international convention governing copyrights. It extends copyright protections to creators in countries other than their own, but enforcement remains a national issue. The U.S. Digital Millennium Copyright Act and its foreign counterparts are enactments of two other WIPO copyright treaties that were designed to curb online copyright violations. But again, enforcement rests with individual nations. Licensing options. Many in the language resource community embrace open source/copyleft options such as GNU General Public License and Creative Commons on the premise that those options promote a social good and that use restrictions and commercial exploitation are social evils. Yet, commercial users are contributors to the LR community. One can argue that bringing products to market is a form of sharing, and if the market operates as expected, declining prices over time make products more widely available. Moreover, viral copyleft provisions that require source code-sharing and redistribution are problematic not just for commercial users but for LRs with proprietary content. Emerging modes of communication. The law lags behind technology. The LR community is faced now with important questions about how to use blogs, newsgroups, web video, SMS and social networking sites, but there are virtually no laws, regulations or court decisions governing these modalities. Who owns that data? The creator/poster, commenters, the site host? How do websites’ terms of use interact with any intellectual property rights and how are they different? What are the individual privacy and informed consent considerations for LR development? New LR creation/annotation methods. LR creation and annotation has moved to the virtual community with Amazon’s Mechanical Turk and volunteer translation sites such as Translations for Progress. Can source data licensed from third parties be made available for public annotation via such sites? Conclusion The LR community’s support for, and work toward, leveraging new technologies to increase the availability of language resources are at odds with a legal landscape that is shifting toward less, not more, open access and sharing. Navigating this permissions culture and adapting current distribution models to new modalities are key challenges for the community. References Amazon Mechanical Turk. <https://www.mturk.com/mturk/welcome> (23 January 2010). Lessig, Lawrence. Free Culture. New York: The Penguin Press (2004). Translations for Progress. <http://www.translationsforprogress.org/ngoguide.php> (23 January 2010).

73

https://www.mturk.com/mturk/welcome

http://www.translationsforprogress.org/ngoguide.php

Data Collectors, Data Hunters, and IP Issues

Daniel Grasmick, Lucy Software and Services GmbH Numerous articles, papers, blogs have recently been written about the role of mass linguistic resources aka data. All possible superlatives have been used. We nevertheless need to step back and have a closer look at the approaches that have been applied and at issues around the ownership of the data. What is important? The more, the better? We constantly find that providers of data (be they producers of the source or service providers of the target language versions) rarely have a clear strategy. Memories have been collected over the years and the main aim seems to be that the volume should help in cutting costs and feeding all sorts of tools: terminology or Machine Translation. Therefore it definitely also depends on the purpose you are collecting the data for. E.g. termbases and MT need different types of information. Most of the users of TM technology started to set up memories per project/domain in the nineties – since the tools could neither handle mass data nor concurrent users. When the technology started evolving, the tendency was to merge these fine-granular memories into large pools in order to increase reuse – and cut costs. When the source segments had differing translations, the vast majority of the TM administrators either used the “overwrite” or the “keep-them-all” method. How could they possibly be in a position to re-qualify tens of thousands of segments, most of them without context and in languages they could not even read? Editing these memories was quasi impossible, from an organizational as well as cost point of view, not to mention the time factor. Thus it is essential to define a clear strategy prior to collecting data. Typical issues in mass data are:

- Adequacy of the translation in a different context (more often does not equate with better)

- Up-to-date and correct terminology - Varying quality - Absolute lack of traceability

Which company has really planned the following topics before starting to use translation tools?

- Structure of the terminology and the translation memories

74

- Coverage of different text types - Flexibility (e.g. supporting various releases, updates and new releases in

parallel) The ownership of the data is an additional issue that is very often ignored or at least completely underestimated. Who owns the data? The translator, the agency or the customer? How often have we heard from agencies and translators that they test or even use Machine Translation and recycle the mass data they have collected from their customers over the years? Some agencies even feed Google Translate or other technologies available on the Web. Are they allowed to do this at all? The legal situation is clear: Who has the copyright of a Translation Memory – if it has not been stipulated differently per contract?

� The author of the source text has the copyright of the source text. � The author of the translation has the copyright of the translation. � The use of the source text in the Translation Memory has to be allowed by the

copyright owner Furthermore, gathering texts or parts of texts published on the web for a translation database “TM” - without explicit permission by the authors is thus a clear breach of copyright laws. Sending TMs to a provider on the Internet (i.e. adding source texts into a text corpus) needs the explicit permission of the copyright owner, who in this case is the author of the source text. Otherwise, this is also a clear breach of the copyright laws. In this short presentation, Daniel will touch on all these topics and use pragmatic examples to demonstrate the big legal issues a lot of companies may face – very often without being aware of it. FLaReNet Forum February 11, 2010

75

Friday 12th February 2010

S4. Social Networking and Web 2.0 Methods for Language Resources 9:00-11:00

Chair: Jan Odijk – Rapporteur: Gerhard Budin


Introductory Talk

“Federated Operation for Service-Oriented Language Resource Sharing” Toru Ishida (Kyoto University, JP)

Contributions

“Linguamón Audiovisual, a new interactive, social-oriented portal devoted to languages” Marta Xirinachs (Linguamón - Casa de les Llengües, SP) and Sergi Fernández (i2CAT Foundation - Cluster Audiovisual, SP)

“How can the Web 2.0 be useful in collecting data? The Sorosoro experience” Rozenn Milin (Fondation Chirac, FR)

“How to keep distributed efforts for developing semantic vocabularies together?” Piek Vossen (Vrije Universiteit Amsterdam, NL)

“Extracting social and knowledge networks from LREC registration data” Thierry Declerck (DFKI, DE)

“Proposition for a Web 2.0 version of linguistic resource creation” Gregory Grefenstette (Exalead, FR)

“Social Translation: How Massive Online Collaboration Could Take Machine Translation to the Next Level” Juan Antonio Pérez-Ortiz (Universidad de Alicante, SP)

“What's the difference? - Comparing Expert-Built and Collaboratively-Built Lexical Semantic Resources” Torsten Zesch (Technische Universität Darmstadt - UKP Lab, DE)

“Using the Amazon Mechanical Turk for the production of Language Resources” Gilles Adda (LIMSI - CNRS, FR)

Discussants

Laurent Prévot (Université de Provence - LPL, FR)

Francesco Ronzano (CNR - IIT, IT)

77

Federated Operation for Service-Oriented Language Resource Sharing

Toru Ishida1,2 and Yohei Murakami2 1Department of Social Informatics, Kyoto University, Japan

2The Language Grid Project, NICT, Japan [email protected], [email protected]

The Language Grid Though there are many language resources (both data and software) on the Internet, most

application users have no way of employing the existing language resources, because of complex intellectual property rights, non-standardized application interfaces, and so on. If technologies were available that could provide a platform to share language resources, and to create language services for application users, it is likely that people would start to use language resources more often in daily life. Since the Language Grid takes the collective intelligence approach, the platform can grow only through the voluntary efforts of users. The more users provide language resources, the more they appreciate the benefits of the resources. Thus the platform allows users to create Web services and share them via the Language Grid.

more more

Disaster Management Education Medical Care

Sharing Multilingual Information

Universal Playground Translation Services at Hospital Receptions

Language Support for Multicultural Societies

Sharing language resources such as dictionaries and machine translators around the world

German Research Center for Artificial Intelligence Stuttgart University

National Research Council, ItalyChinese Academy of Sciences

National Institute of InformaticsNICT

NTT Research LabsAsian Disaster Reduction Center

Kookmin University

Princeton University

NECTECUniv. of Indonesia

Google Inc.

Figure 1: The Language Grid Centralized Operation Model

To design an operation model, we first collected requirements of the stakeholders: university laboratories and research institutes (typical language service providers), and NPOs, NGOs and public sectors (typical language service users). Most service providers require that the provided services are used solely for non-profit activities. To meet this requirement, we need the Language Grid Service Manager that allows stakeholders to monitor how language services are used. That is, to satisfy the requirements of the service providers, users cannot be anonymous. Note that, however, this does not exclude for-profit contracts concluded outside of the Language Grid that make recourse to commercial language services.

In our centralized operation model, service providers can fully control access to their provided language services. Service providers can select users, restrict the total number of accesses per year/month/day, and set the maximum volume of data transfer per access. On the other hand, service users can allow participants in events or activities organized by the users to utilize the Language Grid. To avoid the fraudulent usage of language services, however, service users

79

should not allow the participants to discover the ID and password of the Language Grid. For example, in the case of an NPO offering medical interpreter services to foreign patients, the NPO should not enter their Language Grid ID and password in front of the patients, but embed them in their patient support systems.

After starting the centralized operation at Kyoto University in December 2007, the number of participant organizations has steadily increased. So far, 118 groups from 17 countries have signed the agreement and are sharing more than 60 language services. Roughly one third of them are providing services, another one third of them are using services, and the remaining one third are inactive. We expected that NPO, NGO and public sectors would become the major users, but universities are using the Language Grid more intensively. Researchers and students who are working on Web analyses, CSCW, and multicultural issues are actively using language services for attaining their research goals. This is natural in the early stage of introducing a new technology. Several companies have also joined: Toshiba, Oki and Google provide their translation services without any charge. Federated Operation Model

During its two years in operation, we have discovered that the centralized operation model has several drawbacks. The most serious problem is that operation center in Kyoto cannot reach local organizations in other countries. As a result, 74% of participant organizations are in Japan. Since we need global collaboration, even for solving language issues in local organizations, this imbalance should be overcome: we need to distribute operation centers in different countries. To encourage users worldwide to share language services, we started designing a federated operation model, where multiple operation centers collaborate with each other.

This is not an easy task, since there are conflicting requirements: a federation of operation centers should provide uniform services, but at the same time, each operation center has to be organized within its own context including governing law. It took us almost one year to create a federated operation model. By introducing the affiliated operator and affiliated users, we can naturally extend the centralized operation model: affiliated users can use the services through the affiliated operator, if the affiliated users sign the same agreement. Though there remain several technical issues, we have almost finalized the new agreement through collaboration with researchers in Asia and Europe. NECTEC in Thailand will set up a new operation center from April 2010 together with the operation center in Kyoto University.

Another problem is that provided services are to be used solely for non-profit activities. We are often asked the same question: can the services registered to the Language Grid be used in the research labs of companies? According to our original model, profit organizations can use language services only for CSR activities, and cannot use them in their research labs. This constraint, however, seems too strict: university researchers are often puzzled by this limitation when working with companies. Indeed, the original model suppresses many of the potential applications of language services. Therefore, we will relax this restriction while keeping the control power of service providers. Companies will be encouraged to join the Language Grid, and to provide high quality services, just as Internet service providers do at present.i i This work has been supported by Strategic Information and Communications R&D Promotion Programme of the Ministry of Internal Affairs and Communications of Japan.

80

Linguamón Audiovisual, a new interactive, social-oriented portal devoted

to languages

Marta Xirinachs and Sergi Fernández Linguamón – House of Languages

i2CAT Foundation – Cluster Audiovisual The Linguamón Audiovisual portal is near to being completed. Linguamón Audiovisual is designed to be used by the general public, whom it enables to establish online communities for the generation, storage, dubbing, subtitling and exchange of audiovisual material on the world's languages. Tools are also featured in the portal which allow users to practise different languages and to form language exchange partnerships with other users. The portal is a groundbreaking platform via which audiovisual material can be managed and distributed safely and efficiently. It establishes a common standard for the exchange of language-related information, and incorporates the latest international trends and state-of-the-art technology for processing information in a web-based environment. A Beta version is already available on line and the portal's pilot test is due to end over the next few months. A presentation video for the portal has been produced in Catalan, English and Spanish.

Via Laietana, 46A, pral. 1a08003 Barcelona Tel. +34 932 689 073 Fax +34 933 104 854 [email protected]

81

How can the Web 2.0 be useful in collecting data ? The Sorosoro experience

Rozenn Milin

Fondation Chirac The Sorosoro programme was launched in 2008 and could be described as a Conservancy for Endangered Languages and Cultures. In a few words, there are about 6000 languages in the world, and half of those are in danger of disappearing in the course of this century. According to UNESCO, a language dies every other week. Since most of those languages are not written ones, once their last speakers die, the whole knowledge they carry is lost forever. Our first goal is to build a Digital Encyclopaedia of Unwritten Languages and Cultures, a database of images, sounds and texts of the languages and cultures of the world. To accomplish this, we send film crews out in the field alongside with linguists and anthropologists. Digitized, stored and classified for the long term, the audiovisual data they collect are saved for future generations. It can then be used for scientific purposes, as well as educational purposes. We also do communicate about our work through an Internet site (www.sorosoro.org) which was launched last October, in three major languages: French, English and Spanish. We already have online dozens of pages of text, interactive Google maps including 5000 languages and a number of videos, but the work is far from being complete. We still lack a lot of data about most of the languages we intend to cover. There are various ways to collect that type of data: the conventional way is to contact linguists, research labs, universities, libraries, archive centers etc, fort both written and audiovisual material. It is time consuming but efficient, and hopefully the data we collect is accurate. Although that type of research has given very good results, it cannot cover the whole ground we want to explore: a lot of languages in the world are not studied by anybody and we have to find other ways to collect data. Internet has proven very useful for that matter. Our website is interactive, and we ask Internet users to feed us back with information, data, anything that can help us complete the picture. Anybody can contribute in a way or another:

- We ask the general public to help us locate or relocate languages on our Google maps when we do not have secured information about those localisations;

- Anybody can fill out a form on line, to describe a language or a language family they know;

- We also ask for audiovisual material in any type of format, that we can digitize, secure, and possibly put on line;

- And to make this all playful and attractive, we post a new word every other week to be translated into all possible languages by Internet users.

The answers we get mostly come from local people, actual speakers of the language, interested in saving it and/or studying it. It is mainly “rough” information, not processed through the prism of Western scientists.

82

http://www.sorosoro.org

Of course, this way to collect data through Internet “calls” cannot replace academic contributions, but it is a precious additional source of information. This way, we did get data about languages we didn’t know much about, we were offered audio and audiovisual material that we would have never found otherwise, and we were also asked frequently to send filming teams on field to contribute to safeguard data. The website was then very useful, as well as the newsletter we have started to send around broadly. We still have to explore more the social networks such as Facebook, but from our first results those do not look as successful as the website itself.

83

How to keep distributed efforts for developing semantic vocabularies together?

Piek Vossen Vrije University Amsterdam

In recent years Wiki systems have shown to be a powerful tool for building open and free knowledge resources through social communities.

Wiki systems come in many different flavors and have been set up for generic resources and for specific domains. Although Wikis have not been set up for making lexical resources, the databases have been used as lexical resources nevertheless. Independently, specialists in various areas have been developing vocabularies and thesauri for their domain. Well-known examples are Mesh and Snomed in the medical domain but also industrial vocabularies for products and technological terms. Again, these resources are not developed as lexicons and usually lack linguistic information on the usage of their terms. Furthermore, the resources remain distributed and are not aligned with each other. This leads to duplication, inconsistencies and lack of interoperability. Consequently, technologies are being developed to align such resources a posteriori.

In my presentation, I will present the architecture of the KYOTO project which is a platform for developing domain vocabularies in distributed social communities for different languages. As such, KYOTO tries to achieve semantic interoperability of these domain vocabularies across languages but also with generic language resources (i.e. wordnets) and any background vocabularies that are already available in the domain. The development of the domain vocabulary is done by the experts in the domain, who are supported through an easy to use Wiki platform and through automatic acquisition of terminology. Acquired terminology and background vocabularies are automatically aligned to existing resources through word-sense-disambiguation techniques. Through this platform, social groups in specific domains can maintain their vocabulary through time and still contribute to the overall interoperability of lexical semantic resources and share the result of their effort. These groups themselves will also benefit from available generic resources (possibly in different languages) and from domain resources of related domains. Finally, language technology can use richer linguistic features presented in generic language resources linked to the domain databases.

84

Extracting social and knowledge networks from the LREC registration data Thierry Declerck, DFKI GmbH

As a possible additional contribution to the “LREC2010 Map of Language Resources, Technologies and Evaluation” (http://www.lrec-conf.org/lrec2010/?LREC2010-Map-of-Language-Resources), we investigated the kind of information that can be extracted from the LREC submission database (the 2008 edition), and if it can be organized in social and knowledge networks. This concerns basically information about authors of scientific papers, their affiliation, the country they are belonging to, the topics they address and the language(s) they cover in their work. This should then be combined with the information about language resources and technologies that will be available in the submission data of LREC 2010 (the LREC 2010 Map). Some details of the approach, as applied to LREC 2008 The on-line proceedings of LREC 2008 (http://www.lrec-conf.org/proceedings/lrec2008/), are displayed various interlinked lists, as shown below (with few additional comments from us):

– "Sessions" contains introductory messages, keynote speeches, panels information, and titles of papers organized by session.

– "Papers" contains a list of the papers organized by their title. Additionally the list contains the abstract, all the authors, the topics (as selected by the authors on the base of LREC suggestion), the PDF version of the paper, possibly the presentation slides, and the bibtex file.

– "Authors" contains a list of all authors and their related papers and their affiliation – "Workshops" contains a list of all workshops and tutorials. – "Topics" contains a list of all topics and their related papers. – "Affiliations" contains a list of all affiliations (including name of country) and their

related papers. We map part of this information onto an XML representation, which for the time being has the following format <authors> <author>firstname="Akinori" lastname="Abe" affiliation="ATR Knowledge Science Labs." affiliation_location="JAPAN" <papers> <paper> <title>Relationships between Nursing Converstaions and Activities</title> <coauthors>

<coauthor>firstname="Hiromi" lastname="Itoh Ozaku"</coauthor> <coauthor>firstname="Kaoru" lastname="Sagara"</coauthor> <coauthor>firstname="Kiyoshi" lastname="Kogure"</coauthor> </coauthors> <language>Multiple languages</language> <topics> <topic>Corpus (creation, annotation, etc.)</topic> <topic>Information Extraction, Information Retrieval</topic> <topic>Acquisition, Machine Learning</topic> </topics> </paper> </papers> </authors>

85

http://www.lrec-conf.org/lrec2010/?LREC2010-Map-of

http://www.lrec-conf.org/proceedings/lrec2008/

On the base of this data, we can for example extract a network of authors working on similar topics, as sketched below

We are here thus extracting direct and indirect relations between authors, constrained by topics. Similar word is applied to affiliations and countries of affiliation, also using the covered language(s) as restriction for the detected relations.

Related work and resources We aim at cooperation, integration and interaction with related initiatives, like for example LT-World (http://www.lt-world.org/), mainly in using and extending its background ontology, or with the “Take” project (http://www.dfki.de/lt/project.php?id=Project_539&l=en), especially on the topic of Information Extraction applied to scientific/technological literature in the field of language technology, which is among other dealing with the ACL Anthology initiative (http://aclweb.org/anthology-new/) or with an approach on mining expertise from scientific publications (see Paul Buitelaar and Thomas Eigner. Mining Expertise Topics from Scientific Literature, SAAKM 2009) . In doing this, we will aggregate information from a supervised information portal, information extracted from the content of scientific papers and information extracted from submission databases of big conferences. Possible future work

• Use LT-World to gain additional information (for example related projects, etc) and to eliminate duplicates in the naming of persons, institutions, etc.

• Use the semantic storage developed in the European R&D project “MUSING” for supporting reasoning and (semantic) querying. Integrate the extracted networks to an ontology. Include the temporal dimension (between editions of LREC and other conferences)

• Aggregation with results of content analysis of papers of conferences (for example project TAKE), or expertise mining (work by Paul Buitelaar and Thomas Eigner).

• Identify possible cooperation links (are there countries that are working isolated on a topic, etc.)

Acknowledgements: Thanks to the CNR-ILC team for fruitful discussions, to Brigitte Joerg, DFKI, for support on LT-World, to Andreas Weber, research assistant at DFKI, for implementation support and the CLARIN and D-SPIN projects.

Author

Co-Author

Co-Author

Co-Author

Paper_x; Topcis_a:

b:

c

Paper_y: Topics_a:b:c (Co-) Author

86

http://www.lt-world.org/

http://www.dfki.de/lt/project.php?id=Project_539&l=en

http://aclweb.org/anthology-new/

Proposition for a web 2.0 version of linguistic resource creation

Gregory Grefenstette

Exalead

10 place de la Madeleine, 75008 Paris, France

[email protected]

Position

Linguistic resources should be free. It is in the interest of every language community that language resources for their language are freely available. The existence of language resources for a language can serve two purposes: defending the existence of the language thus preserving cultural heritage, and providing a means of communicating into and from the language thus easing trade. Language resources can help ease the problem of language variation which is an impediment to information access and information transmission. Dimensions of the problem Currently, Kevin Scannell of St Louis University currently monitors 446 languages on the web (Scannell, 2007). His web page http://borel.slu.edu/crubadan/stadas.html lists these languages and the number of words that he has found for each by crawling the web. Many languages (for example, Abua, Akurio, Bashkir, Bhojpuri, Chayahuita, etc.) have only one or two documents (one often being the Universal Declaration of Human Rights

1, currently

available in 370 languages) and a few thousand words. Sources of lexicons One source for lexicon extraction is online news. Google news exists in 70 national version, though there are many in the same language, for example, Spanish is used in 9 of the national versions. A European Union initiative, the European Media Monitor, EMM

2, monitors news in 43

languages (Steinberger et al., 2007). The Natural Language Processing Group of the Computer Science Department of Leipzig University has downloaded and packaged corpora

3

from public sources in different sizes (100k, 300k, 1 million and 3 million sentences). They provide sentence corpora for research purposes for Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Italian,

1 http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx

2 http://emm.newsbrief.eu/overview.html

3 http://corpora.informatik.uni-leipzig.de/download.html

Japanese, Korean, Norwegian, Sorbian, Swedish , and Turkish (Biemann at el., 2007). As for Wikipedia, there are currently 269 language versions of Wikipedia, though some entire sites are very restricted. For example, the Chichewa Wikipedia contains 65 articles for this language spoken by 9.3 million people in Zambia and Malawi. Proposal All these resources and sources are insufficient to solve the problem of creating free and complete resources for the world languages, even for the 446 that have some web presence. We propose creating a Web 2.0 site for using the same community computing power than generates millions of blogs to solve the problem of creating a basic language resources for all the world’s languages, starting with the 446 present on the web. Though the CLARIN project (Boves, et al. 2009) aims at providing high quality language resources for the panoply of NLP activities from tokenization to speech, there is a need for simpler and complete resources for all languages, and we believe that harnessing the power of web users can provide the realization of this dream. We think that a website can be created that will allow end users to adopt a certain number of words, in packets of ten, for example. Following the example of the construction of the Oxford English dictionary when James Murray invited volunteers from around the world to submit evidence of word usages, we can create a site where users can take a certain number of words and provide meaningful resources. For example, we could give the users a number of words such as the following French words étrangères, liberation, mauvaise, avantage, représentent

87

http://borel.slu.edu/crubadan/stadas.html

http://www.ohchr.org/EN/UDHR/Pages/Introduction.aspx

http://emm.newsbrief.eu/overview.html

http://corpora.informatik.uni-leipzig.de/download.html

and ask the users to return the surface form, the lemma,the major part of speech and a few English

4 translations. For

example, étrangères ; étranger ; ADJ ; foreign, stranger libération ; libération ; N ; liberation mauvaise ; mauvais ; ADJ ; bad avantage ; avantage ; N ; advantage représentent ; répresenter ; V ; represent Users responses could be controlled by any of the known user rating systems available on the web. For example, the same words could be given to multiple users and the users who reply most like other users will be more highly ranked than those that given outlying answers (Surowiecki, 2004). Users could be further ranked by the number of words they have “solved” for a given language. These simple representations of words could be used as springboard for much wider resource creation, for example by adding language dependent frequencies to each word from search engine probing. Another example would be the generation of multiword expressions and their translations (Grefenstette, 1999). We will defend this approach and sketch how it can be implemented.

References

Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S.,

Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources

for Speech Research: Present and Future Infrastructure Needs. In

Interspeech (pp. 1803-1806). Brighton, UK.

C. Biemann, G. Heyer, U. Quasthoff, and M. Richter. The

Leipzig corpora collection - mono-lingual corpora of standard

size. In Proceedings of Corpus Linguistic 2007, Birmingham,

UK, 2007

Grefenstette, G. 1999. The World Wide Web as a resource for

example-based machine translation tasks. In Proceedings of Aslib

Conference on Translating and the Computer 21. London.

Scannel, K.P. (2007) The Crúbadán Project: corpus building for

under-resourced languages. In Fairon, C., Naets, H., Kilgarriff, A.

and de Schryver, G.-M. (eds.) Building and exploring Web

corpora. Proceedings of the WAC3 Conference. Louvain: Presses

Universitaires de Louvain. 5-15.

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C.,Erjavec, T.,

Tufiş, D., Varga D. (2006). The JRC-Acquis: A multilingual

aligned parallel corpus with 20+ languages. In Proceedings of the

5th International Conference on Language Resources and

Evaluation (LREC'2006). Genoa, Italy, pp.2142-2147

4 Some other language than English could be used, for example

Chinese, but for the moment, English is the most used language

on the Internet.

Surowiecki, J. The Wisdom of Crowds: Why the Many Are

Smarter Than the Few and How Collective Wis-dom Shapes

Business, Economies, Societies and Na-tions. Doubleday Books,

2004

88

Social Translation: How Massive Online Collaboration Could Take Machine Translation to the Next Level

Position paper for the FLaReNet Forum 2010

Juan Antonio Pérez Ortiz

[email protected] Research group

Departament de Llenguatges i Sistemes InformàticsUniversitat d'Alacant, Spain

Introduction

Internet users, mostly passive consumers in the first years of the web, have quickly become active pro-sumers1 of information in the current era of the web 2.0 and the cloud.2 However, in spite of the vast amount of contents and collaboratively-created knowledge uploaded to the cloud during the last years,3 linguistic barriers still pose a significant obstacle to universal collaboration as they lead to the creation of “islands” of content, only meaningful to speakers of a particular language. Until fully-automatic high-quality machine translation becomes a reality, massive online collaboration in translation may well be the only force capable of tearing down these barriers [3] and produce large-scale availability of multilingual information. Actually, this collaborative translation movement is happening nowadays, although still timidly, in applications such as Cucumis.org, OneHourTranslation.com or the Google Translator Toolkit.

The resulting scenario, which may be called social translation, will need efficient computer translation tools, such as reliable machine translation systems, friendly postediting interfaces, or shared translation memories. Machine translation technology provides draft translations which are then proofread by people (a process known as postediting), a task which may require less effort than doing the translation from scratch. Remarkably, collaboration around machine translation should not only concern the postediting of raw ma-chine translations, but also the creation and management of the linguistic resources needed by the machine translation systems; if properly done, this can lead to a significant improvement in the translation engines.

Therefore, massive online collaboration could eliminate the language barriers on the web, but for this to happen as many hands as possible are necessary; and this includes involving speakers that, in principle, do not have the level of technical know-how required to improve machine translation systems or manage lin-guistic resources. Consequently, software that can make those tasks easier and elicit the knowledge of both experts and non-experts must be developed [1]. Note that people's contribution should not necessarily come in an intentional manner; for example, postedited texts are a rich source of information for learning algo-rithms.

The whole picture shows a world where people, intentionally or unintentionally, contribute to a commons of linguistic resources and translation engines, which, ideally, should be as open and widely accessible as possible to boost synergies and network effects.4

Paradigm shift for true social translation

In order to fully accomplish the goals of the scenario that has been drawn up for social translation, my thesis is that a new paradigm that takes into account the following issues should emerge. Note, however, that the real world is affected by a number of pulling and pushing forces as well as conflicts of interest which could set social translation at some distance far from the ideal point.

• Data portability. People should be the real owners of their data and be able to reuse them across fully interoperable translation applications (walled gardens restricting data export or access from the out-side are the other side of the coin).

• Standard formats. Reusable and portable data implies that standards or common practices are used for their encoding and representation.

1 Users which are both producers and consumers.2 Two more neologisms which are commonly used as synonyms for the presentday internet.3 E.g., more than one million messages are sent every hour to the Twitter microblogging service [6].4 When network effects hold, the value of a product or service increases as more people use it.

89

• Licenses. In order to ease the global effort on social translation and prevent “wheel reinvention”, ap-propriate open licenses should be embraced. Ideally, this kind of licenses should be recommended as default to every user.

• Linked data. “Data is relationships” Tim Berners-Lee said in a recent talk on Ted.com; in our case, this means that all the linguistic data should be annotated and properly interconnected.

• Cloud computing. Machine translation should be in the cloud instead of confined to the desktop of a single user. The global flow of translations, posteditings, linguistic data, etc. constitutes the main re-source for social translation.

• Scalability. In order to fulfill the demands of global translation, related applications and programming interfaces (API) should be scalable [5] and provide low-latency delays.

• Code availability. As important as open data is the availability of free/open-source software that en-courages research and development of applications in the social translation area.

• Multiengine translation. All the available machine translation systems should cooperate and offer public APIs; this could lead to hybrid systems choosing the most appropriate machine translator ac-cording to context.

• Standard interfaces. Interfaces for the different ways of interacting with machine translation systems (postediting, management of linguistic resources, etc.) should be made uniform and predictable (as it is the case, for example, with word processors).

• Accessibility. As many people as possible should be able to contribute to the social translation sphere.

The Tradubi Web Application for Social Translation

My research group at University of Alacant, Spain, is currently developing Tradubi [4], a free/open-source Ajax-based web application for social translation, whose aim is, firstly, to build a platform for collaborative-ly customizing and improving rule-based machine translation systems and, secondly, to offer an environment for postediting raw machine translations and subsequently sharing the corrected texts. Currently, Tradubi is in its early stages of development and built upon the free/open-source Apertium machine translation engine [2]. The application can be accessed at tradubi.com, or downloaded from tradubi.org and installed on a dif-ferent server. With the help of Tradubi, users can create customized dictionaries for Apertium which focus on specific linguistic domains, or which correct translation errors made by the default system. We expect to augment the features of Tradubi so that it becomes a powerful free/open-source application for social transla-tion.

Conclusions

The massive collaboration of internet users as well as the contribution of science in the development of effi-cient tools allowing the improvement of machine translation engines could be fundamental in the abolition of the language barriers that currently restrict universal access to the web. This large-scale collaboration of ex-perts and non-experts implies a change of paradigm in the way linguistic resources are managed. Tradubi is an example, among others, of a web application that hopefully will help to normalize social translation.

Acknowledgments

This work has been partially funded by Spanish Ministerio de Ciencia e Innovación through project TIN2009-14009-C02-01.

References

1. Font-Llitjós, A., J. Carbonell, and A. Lavie. A Framework for Interactive and Automatic Refinement of Trans-fer-Based Machine Translation. In Proceedings of EAMT 10th Annual Conference, 2005.

2. Forcada, M. L., F. M. Tyers, and G. Ramírez-Sánchez. The Apertium machine translation platform: five years on. First International Workshop on Free/Open-Source Rule-Based Machine Translation 2009, 3–10.

3. Garcia, I. Beyond Translation Memory: Computers and the Professional. The Journal of Specialised Transla-tion, 12:199–214, 2009.

4. Sánchez-Cartagena, V. M. and J. A. Pérez-Ortiz. Tradubi: Open-Source Social Translation for the Apertium Machine Translation Platform. In Open Source Tools for Machine Translation, MT Marathon 2010, 47–56.

5. Sánchez-Cartagena, V. M. and J. A. Pérez-Ortiz. ScaleMT: A Free/Open-Source Framework for Building Scal-able Machine Translation Web Services. In Open Source Tools for Machine Translation, MT Marathon 2010, 97–106.

6. Schonfeld, E. Pingdom Says People Are Tweeting 27 Million Times A Day. TechCrunch.com, November 12, 2009.

90

What's the difference? – Comparing Expert-Built and Collaboratively-Built Lexical Semantic Resources

Torsten Zesch

Ubiquitous Knowledge Processing Lab Computer Science Department, Technische Universität Darmstadt

Hochschulstraße 10, D-64289 Darmstadt, Germany http://www.ukp.tu-darmstadt.de

Introduction Traditionally, lexical semantic resources like WordNet (Fellbaum, 1998) have been built manually by experts in a time consuming and expensive manner (Expert-built Semantic Resources or ESRs). Recently, emerging Web 2.0 technologies have enabled user communities to collaboratively construct new kinds of resources like Wikipedia and Wiktionary (Collaboratively-built Semantic Resources or CSRs). It is an open question whether the emerging collaboratively-built resources are really new or if they are just replicating the knowledge that can also be found in expert-built resources.

Comparing Apples to Oranges A comprehensive comparison of semantic resources is a non-trivial task, as the resources are organized differently (e.g. synsets in WordNet vs. term entries in Wiktionary), use different word sense inventories, encode different types of lexical semantic relations, or use the same type of relation differently. Additionally, collaboratively-built resources usually grow quite fast. Thus, there are no releases (like e.g. WordNet 3.0) that could be used as fixed references. Furthermore, some resources like Wikipedia cannot be easily utilized in their source form, but the contained knowledge has to be extracted in a data mining step (Zesch and Gurevych, 2008). If we use this extracted knowledge for a comparison of resources, should the results be attributed to the crowds or the extraction methods? And how would the results differ if the knowledge was mined from the Encyclopedia Britan-nica instead of Wikipedia? Some collaboratively-built resources lend themselves more naturally to being compared, as they encode lexical semantic knowledge more directly. Meyer and Gurevych (2010) compare three such lexical semantic resources for the German language that are created in a controlled (GermaNet), semi-controlled (OpenThesaurus), and a

collaborative, i.e. community-based, manner (Wiktionary).1 GermaNet (Kunze and Lemnitzer, 2002) is similar to the well known Princeton WordNet. OpenThesaurus (Naber, 2005) is a thesaurus for the German language organized around synsets. Its main focus is collecting synonyms, but also some taxonomic relations can be found in the resource. Wiktionary (Zesch et al., 2008) is a large online dictionary that is available in over 300 languages. Each language edition contains word entries from multiple languages and is interlinked with other language editions.

What’s the difference? The comparison of the resources reveals that the overlap of terms, word senses, and relations is surprisingly low.

tween the resources.Figure 1 visualizes the term overlap be

2 This shows that each resource contains a

31,866

31,23949,114

30,4

88

18,852

OpenThesaurus(66,754)

GermaNet(67,402)

Wiktionary(90,611)

Figure 1: Term Overlap between resources.

1 The results are available from http://www.ukp.tu-darmstadt.de/data/lexical-resources/ 2 These numbers are based on GermaNet 5.0, a OpenThesaurus dump from July 27, 2009, and a Wiktionary dump from June 18, 2009 accessed using the JWKTL tool.

91

http://www.ukp.tu-darmstadt.de/

http://www.ukp.tu-darmstadt.de/data/lexical-resources/

http://www.ukp.tu-darmstadt.de/data/lexical-resources/

large portion of unique lexical-semantic knowledge. For example, GermaNet is the only resource to contain holonymy and meronymy relations. OpenThesaurus contains the highest number of synonyms, while Wiktionary contains the highest number of antonyms and nearly as many synonyms as GermaNet. These findings indicate that the collaborative approach

What needs to be done? The differen y-build and

n

languages, expert-built resources do not

Acknowledgements The author t for valuable

References Fellbaum, C., edito n Electronic

offart, J., Zesch, T., Gurevych, I., 2009. An

on

unze, C., and Lemnitzer, L., 2002. GermaNet –

does not just replicate expert work, but adds a genuine contribution to lexical semantic resources.

ces between the collaborativelexpert-built resources are far from being fully under-stood. Thus, a comprehensive analysis is required. However, the initial results show that the collaboratively-built and expert-built resources contain complementary knowledge, which calls for a combination of resources. The prerequisite for both (comprehensive comparisoand combination) is an alignment of the resources. The hardest part is certainly the alignment of the word sense inventories. More research needs to be done in that direction. An analysis of how the crowds deal with word senses could also help to better understand the nature of word senses. For many exist or are not well developed. In these cases, intelligent user interfaces based on Natural Language Processing techniques (Hoffart et al., 2009) can support the crowds to create more consistent, higher quality resources and to fill the gaps that they are not going to tackle on their own. Wiktionary is an especially promising resource for minor languages, as it (i) grows quickly and (ii) the links between different language editions can be used to bootstrap and augment smaller language editions.

hanks Christian Meyer contributions. This work is funded by the German Research Foundation (GU 798/1-2, 798/1-3, and 798/3-1), the Volkswagen-Foundation (I/82806), and the Klaus Tschira Foundation (00.133.2008).

r. 1998. WordNet: ALexical Database. In: Language, Speech, and Communication. Cambridge, MA: MIT Press. HArchitecture to Support Intelligent User Interfaces for Wikis by Means of Natural Language Processing. In: Proceedings of the 5th International Symposium Wikis and Open Collaboration (WikiSym), Brisbane, Australia. Krepresentation, visualization, application. In: Proceedings of the Third International Conference on

Language Resources and Evaluation. Volume 5., p. 1485-1491, Las Palmas, Canary Islands, Spain. Meyer, C.M., and Gurevych, I., 2010. Worth its Weight in Gold or Yet Another Resource -- A Comparative Study of Wiktionary, OpenThesaurus and GermaNet. In: Proceedings of the 11th International Conference on Intelligent Text Processing and Computational Linguistics, Lecture Notes in Computer Science, (to appear), Berlin/Heidelberg, Springer. Naber, D., 2005. OpenThesaurus: ein offenes deutsches Wortnetz. In: Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen: Beiträge zur GLDV-Tagung, p. 422–433, Bonn, Germany. Zesch, T., Müller, C., Gurevych, I., 2008. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, p. 1646-1652, Marrakech, Morocco.

92

Using the Amazon Mechanical Turk for the production of Language Resources Gilles Adda

LIMSI-CNRS, Paris, France Amazon Mechanical Turk is an online crowdsourcing system that allows users to distribute work to a large number of workers. Work is broken down into simple tasks which workers are paid to complete. Such tasks are typically those that are difficult for computers. This mechanical Turk paradigm can be viewed as a so-called "artificial artificial intelligence". Requesters create Human Intelligence Tasks (HITs), specifying the amount paid for the job's completion. The production of language resources (LR) via Amazon's Mechanical Turk (or equivalent) has lead to a number of papers in scientific conferences these last two years. These papers address an important matter to scientific production: the cost of manual transcriptions/annotations/translations; in many domains this manual work cannot be avoided if accurate models are requested or if a reliable system evaluation is needed. As researchers, we usually do not have an incommensurable amount of money available to develop LRs and evaluate systems. These papers show that it is possible to drastically reduce the cost of producing high quality LRs: $10 to transcribe 1 hour of speech, less than $1,000 to create 4 reference translations for 2000 sentences. Moreover, these papers show that given some post-processing (combination of answers), human expert quality can be achieved. A first issue I want to address in this talk concerns the ethics of using MTurk, paying very few money for a real piece of work. People in the domain who had to transcribe few hours of real speech are aware that this is really a hard job to transcribe. But, HIT ("Human Intelligence Tasks") requesters do not have any direct human nor commercial links with the MTurkers, and offering very low wages for tedious tasks may thus be considered as normal or at least acceptable. In scientific papers dealing with MTurk, the problems of the motivation of the MTurkers is generally not addressed, the solely problems are the amount of money and the quality of the transcription. Some demographic and sociological studies reveal that Turkers are usually well educated US citizens, who want to get spare cash in reward. They also suggest that, MTurk could be an opportunity for educated people from emerging countries (mainly India) to have access to the money from the western countries. However, even if a large number of Turkers do not really need this extra money to live, some others do and for these people, Amazon's MTurk is a sweatshop. A recent study shows for about 18% of Turkers (30% for the India Turkers) reported sometimes or always relying on MTurk to "make basic ends meet". The question raised in this talk is whether we, as a scientific community, really need to collaborate with this virtual sweatshop, because of a limited amount of available money to transcribe/annotate/translate? An open debate should be held in the community on this problem. Beyond the ethical problems, a more practical problem may arise. In a near future, National or International agencies founding LR production, will refuse to finance transcriptions or translations at costs higher than those of MTurks. Even if we are reluctant, we may be forced to adopt the MTurk framework very soon. For the time being, we ignore the amount of HITs the Amazon MTurk system will be able to absorb. At long term, it is questionnable whether Turkers will continue to accept so little reward for tedious work, and we, as scientific community looking for more and more LRs may face a shortage.

93

S5. Language Resources of the Future 11:30-13:30

Chair: James Pustejovsky – Rapporteur: Joseph Mariani


Introductory Talks

“India’s language diversity and resources of the future: challenges and opportunities” Girish Nath Jha (Jawaharlal Nehru University, IN)

“Evolving the NICT Concept Dictionary” Kentaro Torisawa (NICT, JP)

“Language Resources of the Future (a speech-based position paper)” Nick Campbell (Trinity College Dublin - The University of Dublin, IRL)

Contributions

“Next Steps: From the Sentence Structure to the Structure of Discourse” Eva Hajićová (Charles University in Prague, CZ)

“Multi-microphone speech corpora for robust applications in real-world environments” Maurizio Omologo (FBK, IT)

“Multimodal Language Resources” Jean Carletta (University of Edinburgh, UK)

“The essential role of Language Resources for the future of the affective computing systems” Laurence Devillers and Björn Schuller (LIMSI - CNRS, FR)

Discussants

German Rigau (University of the Basque Country, SP)

Antonio Pareja-Lora (Universidad Politécnica de Madrid - OEG / Universidad Complutense de Madrid - DSIC, SP)

Ma Antonia Martí (Universitat de Barcelona, SP)

Kiril Simov (Bulgarian Academy of Sciences - IPP - LML, BG)

95

India’s language diversity and resources of the future: challenges and opportunities

Girish Nath Jha Special Center for Sanskrit Studies

Jawaharlal Nehru University, New Delhi [email protected]

Abstract

The paper talks about India’s delicate balancing act in maintaining its huge linguistic diversity and planning for the future language technology resources and requirements. The paper highlights the challenges and opportunities for the government and the language technology community. The paper begins with language situation in India and the constitutional provisions for maintaining and promoting Indian languages. The next section talks about the efforts by the Ministry of Human Resource Development (MHRD) and its agency for language development called CIIL (Central Institute of Indian Languages) and the Ministry of Communications and Information Technology (MCIT) with its agency TDIL (Technology Development for Indian Languages) for developing Indian languages through its various programs and initiatives. In this context, the paper talks about some major ongoing and past projects relating language corpora and technology

The main focus of the paper however is the future language resources that Indian would need to build technologies in the key sectors of social development, governance, agriculture, education and health. In this context, the paper discusses the key technologies that need to be developed in each sector and the language resources that will be needed

Introduction India has 4 language families – Indo Aryan (76.87 % speakers), Dravidian (20.82 % speakers) being the major ones. These families have contributed 22 constitutionally recognized (‘scheduled’ or ‘national’) languages out of which Hindi has the ‘official’ status in addition to having the ‘national’ status. Besides these, India has 234 mother tongues reported by the recent census (2001), and many more (more than 1600) languages and dialects. Of the major Indian languages, Hindi is spoken in 10 (out of a total of 25) states of India with a total population of over 60 % followed by Telugu and Bangla. There are more than 18 scripts in India which need to be standardized and supported by technology. Devanaagari is the largest script being used by 6 languages. Indian languages are under the exclusive control of respective states they are spoken in. Therefore every state may decide on measures to promote its language. However, since these 22 languages are national (constituent) languages, the center (union of India) also has responsibility towards each of them, though it has certain additional responsibility towards Hindi which is national as well official language of the Indian union. From time to time, minor/neglected languages claim constituent status. The situation becomes more complex when such a language becomes the rallying point for the demand for a new state or autonomous region. This complex linguistic scene in India is a source of tremendous pressure on the Indian government to not only have comprehensive language policies, but also to create resources for their maintenance and development. In the age of information technology, there is a greater need to have a fine balance between allocation of resources to each language keeping in view the political compulsions, electoral potential of a linguistic community and other issues.

97


Language promotion and maintenance by the Ministry of HRD

The MHRD through its language agency called CIIL was set up with the following goals –

• advise and assist central as well as state governments in the matters of language • contribute to the development of all Indian languages by creating content and corpus • protect and document minor, minority and tribal languages • promote linguistic harmony by teaching major Indian languages to non-native learners.

Some of the ongoing initiatives of the CIIL are –

• New Linguistic Survey of India (NLSI) • National Translation Service • Linguistic Data Consortium for Indian Languages (LDC-IL) • Development and Promotion of Minor Indian Languages • National Testing Mission (NTM)

Technology Development for Indian Languages (TDIL) by the MCIT

The TDIL program of MCIT was started with the objective of developing Information Processing Tools and Techniques to facilitate human-machine interaction without language barrier; creating and accessing multilingual knowledge resources; and integrating them to develop innovative user products and services. Among the major activities of TDIL have been –

Basic software tools for Indian languages (National Rollout Plan)

• Software tools and fonts for all 22 Indian languages have been released in the public domain • The CD-ROM typically contains the basic software tools for enabling the linguistic community in the

digital age

Ongoing Language technology/corpora projects in the consortium mode 26 premier Institutes and R&D organizations are working together on projects to develop the advanced technologies & applications. • Development of English to Indian Languages Machine Translation (MT) System: (CDAC, Pune) • Development of English to Indian Languages Machine Translation (MT) System with Angla-Bharati

Technology: (IIT Kanpur) • Development of Indian Language to Indian Language Machine Translation System: (IIIT Hyderabad) • Sanskrit-Hindi Machine Translation: (University of Hyderabad, JNU…) • Development of Robust Document Analysis & Recognition System for Indian Languages: (IIT Delhi) • Development of On-line handwriting recognition system: (I.I.Sc, Bangalore) • Development of Cross-lingual Information Access (IIT, Bombay) • Speech Corpora/Technologies: (IIIT Chennai) • Language Corpora (ILCI) : (JNU)

Language Resource of the future

India has complex needs in terms of language technology resources. Among others, the following areas are expected to be the focus in future –

98

• Language & cultural documentation Indian has many undocumented languages, which may become extinct in the digital age. Many others may be documented but on the fringe and neglected. Such languages and those which are important for cultural heritage need to be comprehensively documented and preserved

o corpora (written and spoken) for minority and fringe languages o corpora (written and spoken) for classical languages important for heritage o corpora of digitized manuscripts

• e-governance e-governance is going to be a major activity with approximately 42,000 crore Indian rupees (10 billion USD) worth of budget from the government spread across 3 five-year plans. The language resources that would be required to be built are -

o land records à OCR (land transfers have been a major activity in rural India) o handwriting samples of Indians à OHWR (difficulty in writing on computers) o speech database à ASR/TTS (high illiteracy rates in rural India) o names database àNER (cultural diversity in names) o database of judicial documents à Expert Systems/search/e-library (slow judicial system,

language barrier) o agri database of crop patterns, water tables, pests, nature of soil, climate changes à Expert

systems (agriculture being the major activity in India) o commodity prices databases à search engines (commodity buying/selling is major activity in

rural India) o localization database for popular software (English software will not sell)

• Primary education and health

o Database of text books à e-library o e-lessons à e-learning and LMS systems o text readers à TTS o corpora of the health domain à Expert systems/ translators o database of Ayurvedic herbs and medicine system à expert system

• Knowledge transfer and communication

o Translation from and to major languages à M(A)T (website, newspapers, publishing) o Translation from major (including English) to minor languages à M(A)T (website,

newspapers, publishing) o Parallel corpora and dictionaries à M(A)T (domain specific translations)

Conclusion

The language technology and resources are of critical importance in present India. They will be more so in the future. Though there are tremendous challenges for building such resources as per global standards, there are immense opportunities for business. Cost-wise, it is cheaper to develop language resources and technologies in India. With over one billion people, emerging economy and only 4% people knowing English, one can very well imagine the market potential for non English resources and tools in India and the profit margin therein.

99

Evolving the NICT Concept Dictionary

Kentaro Torisawa Language Infrastructure Group, MASTAR Project,

National Institute of Information and Communications Technology (NICT), Japan

[email protected]

Overview: This talk gives an overview of the NICT Concept Dictionary, which is a set of lexical databases and tools for their automatic construction (Torisawa el al., 2008) and

some social issues surrounding it. The dictionary is one of the major objectives of the

MASTAR project, our five year project at NICT. These databases and tools are

distributed through the Advanced LAnGuage INformation (ALAGIN) Forum, which

consists of 70 companies and 96 researchers from universities. Its aim is to share and

promote speech and language technologies and resources, including but not limited to our

concept dictionary.

Architecture: The NICT Concept Dictionary is a million-word scale semantic network that was mostly extracted automatically from the Web. At the end of the project, our aim

is to construct a semantic network covering 2.5 million words. Our primary motivation

behind the construction of this network is to expand the knowledge that can benefit an

individual in her or his activities. For instance, the dictionary provides relatively

unknown troubles related to everyday electric appliances, as well as their causes and

potential prevention measures as a part of the semantic network. It can also recommend

some local foods popular in a certain region, or extraordinary new ingredients for some

well-known dishes. This kind of knowledge can be further extended through

generalization, analogies and other loose types of inference by using basic relations

between words such as hyponymy, which are also provided as part of the semantic

network. This enables users to find connections between concepts that are not explicitly

written down in any text but may nonetheless be useful. In short, the NICT concept

dictionary gives users a chance to find “unknown unknowns” – in the infamous words of

D. H. Rumsfeld: things “we don’t know we don’t know”. We regard such “unknown

unknowns” as potential triggers of various types of creative activities including

innovations and risk management, and we believe this to be one of the ultimate

motivations for humans to interact with machines.

100


We think this last characteristic is a desirable feature that has not been explicitly pursued

in previous attempts at constructing language resources. We argue that such functionality

should be seriously explored both in future language resources and the applications that

use them, such as sophisticated search, information extraction, machine translation,

spoken dialog systems. These are also objectives of our MASTAR project.

Liability Issues: We also encountered several liability issues in the construction and distribution of this large-scale semantic network. Given its scale and the sometimes

unorthodox semantic relations it contains, the Concept Dictionary is obviously not error

free. Although human annotators are validating some core relations such as hyponymy, a

million-word scale semantic network is too large for humans to fully check its validity.

Moreover, some relations are extremely difficult to verify, given the unreliability of

information sources on the Web. In a sense, we assumed that the dictionary is a summary

of the Web rather than a “library of truths”. This implies that applications utilizing this

dictionary must always provide the users mechanisms for checking the validity of the

information themselves. Another related issue is an unintended libel. As examples, we

show how some basic relations such as hyponymy and simple inferences using our

dictionary can be a cause of a libel. This means that the applications utilizing the

dictionary must be carefully designed to always include links to the original information

source. (This may even suggest the necessity of an NLP interpretation of Asimov’s

“Three Laws of Robotics”, which should not be a mere censorship by human.)

Distribution Scheme: In some cases we are distributing the automatic tools for constructing a concept dictionary instead of the databases themselves. We would like to

discuss the desirable features of this distribution scheme, including the relations to the

issues mentioned previously, and the ease of adaptation of the dictionary on the users

side.

References Kentaro Torisawa, Stijn De Saeger, Yasunori Kakizawa, Jun’ichi Kazama, Masaki Murata and Daisuke Noguchi, and

Asuka Sumida. TORISHIKI-KAI, An Autogenerated Web Search Directory, in the Proceedings of the Second

International Symposium on Universal Communication (ISUC 2008), p.179-186, 2008.

101

Language Resources of the Future(a speech-based position paper)

Nick Campbell Trinity College Dublin

The University of Dublin, [email protected]

Speech TechnologyCurrent speech technology is well capable of processing the mappings between speech and text, but the challenge at the current time is to model the information flow in interactive speech, where propositional content is conveyed along with discourse controls to simultaneously signal the speakerʼs cognitive attentional, intentional, and emotional states.Current applications of speech technology include general information services, customer-care, robotics, games, and interactive media content, often making use of graphical interfaces which include an avatar, embodied conversational agent, or talking heads. Accordingly, the corpora that need to be collected for future speech technology research must be multi-modal, multicultural, and multilingual, incorporating material for the research of interpersonal communication strategies, and speech-related bodily movements, as well as the characteristics of interactive speech itself.

Speech CorporaThere is already considerable interest in the collection, annotation, and modelling of multimodal corpora, as evidenced by the number of papers reporting such work at recent Interspeech and LREC conferences, and at the special sessions of each. However, there is still little understanding of the best ways to gather representative samples of interactive speech, and of how to control the necessary variation and range of expressivity without rendering the resulting conversational interaction unnatural or artificial in any way.The corpora of speech samples collected in the past have ranged from read speech to spontaneous, from isolated words and numbers to sentences and paragraphs, to whole stories and live telephone conversations and meetings. The focus in these collections has been on the content of the speech material rather than on the style(s) of interaction, and future collections might better concentrate on controlling the interpersonal aspects of the speech rather than on the subject matter or content of the speech itself.

Semantic AnnotationThe processing of ʻmeaningʼ in speech remains a significant challenge. An ISO standard currently being considered for dialogue act annotation (ISO/TC37/SC 4 N442 rev05) proposes eleven (11) classes of discourse act, with “task-related” as the first. Task-related discourse is lexically very complex and will continue to pose a problem for speech technology, but the remaining ten (10) major categories of spoken interaction all feature limited lexical content and are characterised by very complex prosodic variability. Perhaps the most immediate task with respect to data collection for the short-term future of speech processing research is therefore how to address the problem of the sparse coverage of these ten discourse categories.

MulticulturalityMulticultural aspects of speech data collection should also be better covered in the future. It is an important concern for resource collection that the technology should serve all aspects of society, with young and old, technical and professional, educated and naive

102

users all being allowed equal access. However, it is not at all certain that we as a discipline have an adequate knowledge of the different needs of, or even the identity of, these different classes of society. Thus, balancing the sociological aspects of data collection will become a major concern in the near future.

MultilingualityUniversal Linguistic Rights require the provision of language services for all people in their own mother tongue. The provision of governmental, commercial, and individual services in each personʼs own language is a right that our technology can facilitate. However, the collection of representative resources for the less-well represented languages of the world remains a major challenge. Tools and specifications (ʻblarksʼ) exist for such collections, but allocating funding to cover the less-well represented languages of the world must remain a high priority.

MultimodalityWhereas the focus of speech corpora to date has been understandably on the speech itself there is a growing awareness of the multimodal nature of interactive conversational speech. Consequently, the use of video as well as audio in the collection of speech data is now becoming standard. Alongside conventional video, the use of 360-degree video and motion capture equipment is becoming common, and the amount of data in the corpus is growing at ever-increasing rates. This poses difficulties both for the annotation and storage of the material as well as for its distribution, particularly when data collections span several days or months.

Crowd-sourcingWith the growth of the internet we find an increasingly social aspect to this medium and the emergence of crowd sourcing for a variety of tasks. The annotation of speech data in this way will provide a rapid and reliable source of knowledge incorporating the views of a wide range of people from across the world, but provision of easy access to such large amounts of material remains a challenge.

DistributionDistribution of resources will continue to remain difficult and become increasingly more so as the volume of data increases and includes multi-source video and biometric information streams. Proper structuring of metadata will allow fast access to specific subsets of the data, and by selectively varying the compression rates and podcasting the material, wider and more rapid methods of distribution should be achieved.

Rights ManagementThe collection of natural unstructured conversational speech material necessarily concerns the possibility of infringing personal rights when distributing such materials. The legal aspects of ownership of speech material, and the inadvertent release of personal information pose important problems that remain to be addressed.

Quantal nature of ExpressivitySpeech is not a simple or complete one-dimensional information source but functions to signal multiple aspects of underspecified propositional and interpersonal communication simultaneously. The physical sciences have mastered the arts of modelling complex quantal phenomena by addressing such uncertainty without requiring any one specific interpretation to be dominant in absolute terms; we might benefit from adopting a similar approach to the processing of meaning and the interpretation of spoken communication.

103

Next steps: From the sentence structure to the structure of discourse

Eva Hajičová Charles University - Prague

1. At present, it can be (hopefully) taken for granted that the computational-linguistic community interested and/or involved in corpora annotation is aware of the fact that the efforts in this field should concern, among other things, a study of methods and scenarios of annotation of some kind of underlying (deep, syntactico-semantic) structure of the sentence. It also seems to be commonly accepted that it has an important advantage if the annotation scenarios are designed in a systematic way rather than proceeding from a certain phenomenon (or a class of phenomena) to another one, one by one. The latter approach certainly offers a richer empirical material for a study of the individual phenomena (if only sentences are selected, the occurrences of the phenomenon in question are more frequent) but it involves an undisputable danger that when passing over to some other phenomena the consistence of the annotation and the account of the interaction of different phenomena or structures will be lost. 2. Another point, which our community (again, hopefully) shares is the belief that a sentence structure annotation is not an ultimate goal of corpora annotation because most different applications require data that go beyond the boundary of the sentence, namely what can be called a discourse structure This step is connected with several methodological, linguistic and formal issues that are briefly discussed below. To a certain extent, these issues also go beyond linguistics proper: to deal with them, we would need to consult the results and achievements of cognitive linguistics, argumentative logic, etc. 3. One of the relevant methodological aspects is the issue of the status of the discourse layer. In other words, in a multilayered scenario of annotation, is it an additional layer to the (underlying, deep, syntactico-semantic, tectogrammatical) annotation or is it a layer which is “aside” or even independent? There are two aspects of this question that have to be taken into account. First, some relations that are usually understood as ‘discourse relations’ are also inherent to the structure of the sentence and as such must be accounted for in any annotation scenario of the syntactico-semantic structure of the sentence. This becomes apparent especially in dependency based schemes, in which embedded clauses are customarily represented as dependent (via their verbs) on the verb of the governing clause. This mainly concerns the so-called free modifications (adjuncts) and it is still an open question (for linguistic research) to what extent such dependency relations have the same character as relations between sentences. For example, whether the temporal relation between the main and dependent clause in When mother came home, she immediately put on an apron and began to prepare dinner. Is the same as the relation between independent sentences in Mother came home. She immediately put on an apron and began to prepare dinner. The same considerations hold for cases of coordination: in the above situation we can segment the sentence into three independent sentences: Mother came home. She immediately put on an apron. She began to prepare dinner. Even more difficult is the decision in case of the so-called “parcelation”, i.e. expressing sentence parts as independent sentences as in the following segment: That dog barks all the time. And even bites. And also scratches. or He came late. Approximately at seven. One way how to deal with such constructions is to treat them as superficial deletions, i.e. to “reconstruct” the deleted elements in the underlying structure of the sentence.

104

Second, sentences often contain elements that stand somehow “outside” the structure of the given sentence though they by no means can be treated as auxiliaries (i.e. as elements that can be ‘merged’ on the level of underlying structure with the elements they ‘belong’ to, as is the case e.g of auxiliaries in complex verb forms, prepositions in prepositional groups etc.). Rather, these elements indicate a relation of the sentence they are part of (or of some structure within that sentence, or even of an item within that sentence) to the preceding context (be it a sentence part or a whole sentence or even a bigger segment). In the annotation of the underlying level of the Prague Dependency Treebank, such items are treated as separate nodes which are assigned their lexical values and a label PREC(eding). This label then serves for a detailed semantic analysis of these cases with respect to their impact on discourse relations. 4. The above issues represent a challenge that concerns mostly the linguistic and methodological aspects of going “beyond” the boundary of the sentence. However, going beyond the sentence boundary puts forward also a challenge concerning the formal and technical aspects. What formal object to choose for the representation of discourse relations? Should we works with some sort of “mega-trees” (i.e. the interconnection of underlying trees into bigger wholes) or with some kind of relational structures? It should be emphasized that this is not only a formal or technical issue: there are good linguistic reasons to preserve in the representation of discourse structure in some way or another the internal structure of the sentence – be it in some enriched form, or in some partitioned form. The most important argument for such an approach is the semantic relevance of the scope of quantifiers and negation, and the related issues concerning the nature of entailments (esp. the specification of presuppositions of the given sentence). A closely related issue is the decision whether the discourse annotation can be based on pure text (unannotated as for the sentence structure), i.e. on strings of surface sentence elements, or whether an annotation of discourse relations can (and even should) benefit from the underlying level annotation. Our arguments above concerning the relevance of intrasentential relations for the semantic interpretation of discourse speak in favour of the latter approach. 5. Our brief enumeration of challenges connected with the future of annotation is, of course, very selective, and in several respects based on our experience with creating an additional layer of annotation. We completely disregarded specific issues connected with annotation of speech, which deserve a special attention and certainly belong to the future of annotation and which would also bring into the foreground issues connected with the representation of dialogues rather than monologues texts. And, closely related to the annotation of discourse structure are the relations of coreference and anaphoric relations. There is lot of work in front of us and many exiting topics for further research. Let’s go ahead.

105

Multi-microphone speech corpora for robust applications in real-world environments

Maurizio Omologo Fondazione Bruno Kessler

Center for Information Technology - irst 38100 Povo (Trento) – Italy

Abstract

The purpose of this position paper is to present a personal outlook for future possible actions aiming to realize new corpora and research tasks referred to robust speech recognition and, in general, to acoustically based perceptual technologies for both monitoring and supporting of human activities in real life.

Introduction

During the last two decades, research on Automatic Speech Recognition (ASR) technology has made significant advances. High performance speech recognition and understanding systems are available for situations where there is a good match between testing and training conditions. Despite this technological progress, the performance of conventional ASR systems often degrades due to variabilities in the input signal which are not always accounted for in the system design and in the training phase. In particular, environmental noise and reverberation due to a microphone located far from the speaker introduce effects that are very difficult to overcome. More information and studies are necessary to better understand the basic problems and to outline new possible approaches to address them.

In the last years, Acoustic Scene Analysis (ASA) has also been investigated with the goal of sensing an environment by means of a set of microphones distributed in space to extract cues regarding speaker location, speaker identity, noise sources, etc., and eventually recognize speech activities in cocktail-party situations.

Full integration of ASR and ASA represents a challenging research goal, but its potential is very high for a wide range of novel real-world applications. Beside traditional applications of ASR as automatic dictation, data entry, in-car command and control, voice dialing and call routing, many new human-computer interface applications are envisaged for the future (e.g. for home automation, assistive devices for disabled users, video-gaming, surveillance, etc.). To this purpose, standardized corpora and tasks are necessary to progress both at research and at benchmarking level.

Resources and tasks

ASR technologies are generally based on statistical methods relying on large speech corpora to represent all the given sources of variability in the best possible way. Many corpora (e.g., TIMIT, Aurora, Switchboard, etc.) were developed over the years for research, for benchmarking innovation results as well as for the development of real products. An important contribution to the progress in the ASR field derives from the availability of those unique corpora and of the related manual annotations.

106

As far as noisy speech recognition is concerned, in particular in case of cocktail-party situations where the use of close-talking microphones is not possible, some works were conducted in the past years to provide researchers and industries with related multi-microphone corpora and tools for specific application contexts (see, for instance, activities related to AMI, AMIDA, and CHIL past EC projects). The given corpora represent other unique and valuable tools to address some specific studies and contexts (e.g., meetings in AMI), but in general are characterized by the availability of a limited number of synchronized acoustic sensors. In this regard, it is also worth mentioning another interesting example, the more recently developed (see http://ssli.ee.washington.edu/cosine) COSINE corpus which consists in multi-microphone in-situ recordings of multi-person conversations related to everyday topics and conducted in quite different environments, both indoors and outdoors. Also in this case, however, the number of sensors was quite limited.

In order to address in a more effective way the convergence between multi-microphone signal processing, noisy speech recognition and understanding techniques, increasing the number of microphones observing a given scene seems to represent one of the fundamental steps to better take benefit of the knowledge of room acoustics and noise source activities. Multi-channel acquisition platforms and special devices (e.g. spherical arrays, MEMS digital microphone arrays, etc.) are today available at a reasonable cost to organize challenging and ambitious activities of resource and task creation. High-quality synchronized multi-microphone recordings may be performed for instance to provide detailed description of multi-party conversations as well as of acoustic scenes captured in real-world environments. Semi-automatic transcription and labeling approaches can then be adopted to reduce the effort required for annotation, which in general represents another relevant cost for these actions. The final target is to eventually develop and standardize large corpora and tools for advanced studies which can be conducted, from different perspectives but in a unified and coherent way, under acoustics, speech, and language related communities.

107

http://ssli.ee.washington.edu/cosine

Multimodal Language Resources

Jean CarlettaUniversity of Edinburgh

Edinburgh, [email protected]

Most current language resources are for text. It’s important to start some-where, but in the age of broadband, audio and video are so easy for citizensto produce and access that they are taking over where text used to do — suchas news media, instruction manuals, even soldiers’ letters home. The languageresource of the future will be multimodal in two ways: it will support the pro-cessing of multimodal language data, and it will use multimodal data to improveprocessing of things that are just text.

1 Multimodal data collection

Collecting multimodal data even a couple of years ago was difficult becausethe equipment was expensive, big, and hard to synchronize. However, now it’spossible to put together portable kit that records language interactions verywell. This means the recording of things like corpora and lexical examples intheir full context is less of a problem. It does require much more storage spacethan textual or speech data — one terabyte for the AMI Meeting Corpus, forinstance. Even now, the best way to transfer this kind of data in its full detailis by shipping it on firewire disks. However, this is a problem that will likelysolve itself over the coming years.

2 Annotation

Corpora are much more useful if they come annotated, and the best way oflinking multimodal examples to other kinds of data resources is to treat thelinks as annotations on some aspect of the signal. We already have at leastbasic annotation tools. There are many very good tools for placing timestampedlabels on video and audio data and sometimes for adding a little bit of structureover the top of the annotations - aligning them into hierarchies, for instance, orjust adding cross references among them. There are also tools for marking uptext in all kinds of ways, some of which can be used on transcribed languageand which will play sets of signals and associate signal timestamps with theannotations. The same annotation infrastructure can be used to represent the

1

108

baroni

Barra

results of automatic processing on signals. These tools aren’t really easy enoughto use and it’s a pain to translate from one data format to another, but they doexist.

3 The language resource eco-system

Most language technologists use existing language resources as input and createthings that could be useful language resources for others as a byproduct of theirwork, but few of them then release these resources for others. They’d often liketo, but it’s just too hard to figure out how to make what they have work withwhat else is available well enough for other people to get value from it. Most ofthe framework that we need for this (stand-off annotation with resource depen-dency tracking, open content share-alike licensing, cloud computing, the datacategory registry) exist in theory and in small-scale application, but the way theparts come together to make a whole hasn’t been tested. Just demonstratingthat they can be made to work together, and then looking for gaps, is the bestway to move forward. Although this is also true for simpler resources, every-thing about dealing with multimodal ones is harder — they’re bigger, harder toprocess, with more possibilities for complex structure in the annotations. It’simportant to take them into account early, because it’s inevitable that we’llwant them in future, and otherwise there’s no guarantee they’ll fit into what wedesign and build.

2

109

baroni

Barra

The Essential Role of Language Resources for the Future of Affective Computing Systems:

A Recognition Perspective

Laurence Devillers and Björn Schuller LIMSI‐CNRS, France

Recognition of emotion in speech has recently matured to one of the key disciplines in speech analysis serving next generation human-machine, human-robot communication, and media retrieval systems. Numerous studies have been seen in the last decade trying to improve on features and classifiers. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardized corpora and test-conditions exist to compare performances under exactly the same conditions. The availability of standard benchmarks has stimulated research in automatic speech recognition. Compared to this field, where several hours of speech of a multitude of speakers in a great variety of different languages are available, sparseness of resources has accompanied emotion research to the present day. Genuine emotion is hard to collect, ambiguous to annotate, and tricky to distribute due to privacy preservation. Acting of emotions was often seen as a solution to the desperate need for data, which often resulted in further restrictions such as little variation of spoken content or few speakers. As a result, many interesting potentially progressing ideas cannot be addressed as the clustering of speakers or the influence of languages, cultures, speaker health state, etc. The few available corpora suffer from a number of issues owing to the peculiarities of this young field as: often no related task, different forms of modeling reaching from discrete over complex to continuous emotions, and a never solid ground truth due to the often highly different perception of the mostly very few annotators. The sparseness is also manifested not only by the lack of several corpora, but by their size: most widely used ones feature below 30 min of speech. Finally, cross-validation without strict test, development, and train partitions and without strict separation of speakers throughout partitioning is the predominant evaluation strategy, which is obviously suboptimal. In this short paper we thus aim to give best practices for overcoming these issues as follows: 1. The quality of the “emotional” annotations in the corpora is fundamental [1, 2, 3]. Previous LREC

workshops on Corpora for research on Emotion and Affect (at LREC 2006 and 2008) have helped to consolidate the field, and in particular there is now growing experience of not only building databases but also using them to build systems (for both synthesis and detection). The LREC 2010 follow-up aims to continue the process, and lays particular emphasis on showing how databases can be or have been used for this system building. The main recommendations are:

- To use a rich “emotion” annotation scheme: multiple labels for describing emotions computed in a soft-vector of emotion, annotation with labels and dimensions, annotation of the contextual information that triggers emotions, annotation of action correlated to emotion in social interactive corpora, and provision of metadata concerning the context of the collection, which are at the same time the cornerstone to better interoperability of language resources,

- To train the coders for having a rigorous perceptive annotation, to know the personality of the coders, and try to use different and balanced personalities for annotating data, and

- To validate the annotations with a rigorous protocol by inter- and intra-annotator agreement considerations and perception tests with large numbers of subjects.

2. The community needs measures for qualifying emotional databases: measures of similarity/dissimilarity or of naturalness in the different databases – related to the task or not. Such measures could be computed from acoustic features and some metadata. At LIMSI, as a first step, we have studied differences between acoustic manifestations of anger across corpora collected in artificial, manipulated or natural context aiming at finding measures of naturalness in emotive corpora. Evaluating the degree of naturalness of a corpus can be challenging unless given knowledge upon the task. In corpora consisting of rather acted data, e.g. anger is often “stronger”. Thus, a kind of distance can be computed between anger (or other suited emotions) and the overall corpus data. Such a distance is introduced in [4] and evaluated with state-of-the-art acoustic descriptors in three collected corpora (cf. above): we showed the observed differences between the acoustic features obtained with anger samples in these different contexts and propose according measures of naturalness. The more a corpus is acted, the more differences between anger and the entire corpus are found. This paper also shows that an acted corpus

110

tends to contain emotions we are not equally able to detect in a spontaneous one; as a result it thus seems important to work with both types of corpus to maintain a general application view point.

3. The general sparseness of emotional data – may it be genuine natural or acted such – is the main problem for training generic models. For natural data this is often amplified by the fact that even in large amounts of speech only little percentage of non-neutral behavior is present [4]. A promising idea is thus to use multiple corpora collected in different contexts (different tasks) for training of more generic models and enrich the variability of the data used [5]. A difficulty particular to the field in this context is the named ambiguity and different definitions of labels, even if these are “textually equal”, e.g. different definitions of “interested”. In [6], we successfully used two corpora collected in different contexts for improving an emotion detection system. However, one cannot be sure to gain from blending by definition owed to the problems stated above. Thus, a measure describing the similarity or difference between data and emotion definitions as described under 2 could be used in advance to find suitable mixtures or adaptation strategies of multiple corpora.

4. In addition, social emotions differ once facing multilingualism and multiculturalism. To study impact of these influences, data recorded in the same application context and only varying in this respect will be needed. Here, however, additional measures may be needed prior to benefiting from a merge of data as described under 3, such as extracting or enhancing the (acoustic) emotional “fingerprint” invariant to language or balancing the intensity respecting cultural differences, etc.

5. Emotion detection systems are usually built from annotated data and training processes that are potentially erroneous at different levels: perception of emotion, feature extraction and selection, and statistical model learning. Thus, measures of reliability are needed at the different steps of the process, such as e.g. by (temporal) utilization of different pitch detection algorithms or alternative learning schemes.

6. There is worldwide consensus that establishing an open resource infrastructure will enable users to deploy language resources in a broadly conceived sense and thus also encompassing tools to their full potential by helping them overcome the problems related to accessibility rights and interchange formats. Such an example would be the Munich open-source environment for features extraction and emotion recognition [7].

7. Further benchmarks and evaluations as the first official Emotion Challenge at INTERSPEECH 2009 [8] will be needed and are among the pre-concerns of our current activities.

8. Finally, ethical issues that concern the privacy and legal issues, but also considering the target application ethics and issues arising from erroneous classification or users’ unawareness of potentially wrong recognition need to be dealt with.

References [1] R. Cowie, E. Douglas-Cowie, J.-C. Martin, L. Devillers: “The Essential Role of Human Databases for Learning in and Validation of Affectively Competent Agents”, in: A Blueprint for an Affectively Competent Agent Cross- fertilization between Emotion Psychology, Affective Neuroscience, and Affective Computing, K. Scherer, T. Bänziger, E. Roach (eds.), Oxford University Press, 2010. [2] Proc. Second International Workshop “Corpora for Research on Emotion and Affect”, Satellite of LREC 2008, L. Devillers, J.-C. Martin, R. Cowie, E. Douglas-Cowie, A. Batliner (eds.), ELRA, Marrakech, Morocco, 2008. [3] Proc. First International Workshop “Corpora for Research on Emotion and Affect”, Satellite of LREC 2006, L. Devillers, J.-C. Martin, R. Cowie, E. Douglas-Cowie, A. Batliner (eds.), ELRA, Genoa, Italy, 2006. [4] L. Devillers, L. Vidrascu, L. Lamel: “Emotion detection in real-life spoken dialogs recorded in call center”, Neural Networks, Special Issue on “Emotion and Brain”, ELSEVIER, Vol. 18, No. 4, pp. 407-422, 2005. [5] M. Tahon, L. Devillers, “Acoustic measures characterizing anger across corpora collected in artificial or natural context”, Proc. Speech Prosody 2010, ISCA, Chicago, USA, 2010. [6] M. Brendel, R. Zaccarelli, L. Devillers, “Building a system for emotion detection from speech to control an affective Avatar”, Proc. LREC 2010, ELRA, Valetta, Malta, 2010. [7] F. Eyben, M. Wöllmer, B. Schuller: “openEAR - Introducing the Munich Open-Source Emotion and Affect Recognition Toolkit”, Proc. 4th International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2009 (ACII 2009), IEEE, Amsterdam, The Netherlands, pp. 576-581, 2009. [8] B. Schuller, S. Steidl, A. Batliner: “The Interspeech 2009 Emotion Challenge”, Proc. Interspeech 2009, ISCA, Brighton, UK, pp. 312-315, 2009.

111

S6. International Cooperation 14:30-16:30

Chair: Nicoletta Calzolari – Rapporteur: Joseph Mariani

Introduction by the Session Chair and Rapporteur

Contributions

Dafydd Gibbon (Universität Bielefeld, DE / COCOSDA)

Isabel Trancoso (INESC-ID / IST, PT / ISCA)

Girish Nath Jha (Jawaharlal Nehru University, IN)

Kentaro Torisawa (NICT, JP)

Takenobu Tokunaga (TITech, JP)

Toru Ishida (Kyoto University, JP)

Key-Sun Choi (KAIST, KR)

Virach Sornlertlamvanich (NECTEC, TH)

Nancy Ide (Vassar College, USA)

James Pustejovsky (Brandeis University, USA)

Satoshi Sekine (New York University, USA)

Christopher Cieri (University of Pennsylvania - LDC, USA)

Hans Uszkoreit (DFKI, DE)

Stewen Krauwer (Universiteit Utrecht, NL)

Khalid Choukri (ELRA / ELDA, FR)

Stelios Piperidis (ILSP - “Athena” R. C., GR)

Roberto Cencioni (European Commission – INFSO - E.1, LUX)

113

Closing Session 16:30-17:30

Chair: Nicoletta Calzolari

FLaReNet Sessions Rapporteurs S1. Nancy Ide (Vassar College, USA) S2: Khalid Choukri (ELRA / ELDA, FR) S3. Núria Bel (Universitat Pompeu Fabra, SP) S4. Gerhard Budin (Universität Wien, AT) S5. Joseph Mariani (LIMSI - CNRS / IMMI, FR) S6. Joseph Mariani (LIMSI - CNRS / IMMI, FR)

Nicoletta Calzolari (ILC-CNR, IT)

Roberto Cencioni (European Commission – INFSO - E.1, LUX)

115

Organisation

Scientific Committee

Nicoletta Calzolari (ILC-CNR, Pisa, ITALY)

Khalid Choukri (ELDA, Paris, FRANCE)

Stelios Piperidis (ILSP / “Athena” R. C., Athens, GREECE)

Gerhard Budin (Universität Wien, Wien, AUSTRIA)

Jan Odijk (Universiteit Utrecht, Utrecht, THE NETHERLANDS)

Núria Bel (Universitat Pompeu Fabra, Barcelona, SPAIN)

Joseph Mariani (LIMSI/IMMI-CNRS, Paris, FRANCE)

Organising Committee

Nicoletta Calzolari (ILC-CNR, Pisa, ITALY)

Paola Baroni (ILC-CNR, Pisa, ITALY)

Tommaso Caselli (ILC-CNR, Pisa, ITALY)

Riccardo Del Gratta (ILC-CNR, Pisa, ITALY)

Monica Monachini (ILC-CNR, Pisa, ITALY)

Valeria Quochi (ILC-CNR, Pisa, ITALY)

Irene Russo (ILC-CNR, Pisa, ITALY)

Claudia Soria (ILC-CNR, Pisa, ITALY)

Local Committee

Joan Soler i Bou (IEC, Barcelona, SPAIN)

Judit Feliu i Cortès (IEC, Barcelona, SPAIN)

Roser Sanromà Borrell (IEC, Barcelona, SPAIN)

117

http://www.ilc.cnr.it/

http://www.cnr.it/

http://www.elda.org/

http://www.ilsp.gr/

http://www.univie.ac.at/

http://www.uu.nl/

http://www.upf.edu/

http://www.limsi.fr/

http://www.immi-labs.org/

http://www.cnrs.fr/


http://www.cnr.it/


http://www.cnr.it/


http://www.cnr.it/


http://www.cnr.it/


http://www.cnr.it/


http://www.cnr.it/


http://www.cnr.it/


http://www.cnr.it/




resources and technologies forum - flarenet · the european language resources and technologies...

Documents