ecp 2007 lang 617001 flarenet · 2009-11-02 · d6.1a – survey and assessment of methods for the...
TRANSCRIPT
D6.1a – Survey and assessment of methods for the automatic construction of LRs
1/46
ECP2007LANG617001
FLaReNet
Deliverable D6.1a
Survey and assessment of methods for the automatic construction of LRs. Report on automatic acquisition, repurposing and innovative proposals for collaborative build
ing of LRs
Deliverable number/name D6.1a – Survey and assessment of methods for the automatic construction of LRs
Dissemination level Public
Delivery date 30 September 2009
Status Final
Author(s) Carla Parra, Núria Bel, Valeria Quochi
eContentplus
This project is funded under the eContentplus programme1, a multiannual Community programme to make digital content in Europe more accessible, usable and exploitable.
1 OJ L 79, 24.3.2005, p. 1.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
2/46
Table of Contents
TABLE OF CONTENTS 2
1 INTRODUCTION 3
2 RESULTS OF THE STUDY ON AVAILABLE LINGUISTIC RESOURCES FOR EUROPEAN LANGUAGES AND TECHNOLOGIES, DESCRIPTION AND COMPLIANCE WITH STANDARDS. IDENTIFICATION OF THE MOST DEMANDED AND THE MOST URGENT RESOURCES. 3 2.1 INITIAL CONSIDERATIONS AND SCOPE OF THE STUDY 3 2.2 APPLICATIONS RELATED TO WRITTEN TEXT 8 2.3 BLARKS AND MOST DEMANDED RESOURCES 10 2.4 MAIN CONCLUSIONS DERIVED FROM THE STUDY. MOST DEMANDED AND MOST URGENT RESOURCES 12
3 CURRENT TECHNIQUES FOR THE AUTOMATIC PRODUCTION OF THE MOST DEMANDED AND MOST URGENT RESOURCES. 18 3.1 CURRENTLY AVAILABLE TECHNOLOGIES 19 3.1.1 Written corpus acquisition and annotation 19 3.1.2 Parallel corpus technologies 20 3.1.3 Lexical acquisition 22
3.2 A SURVEY OF THE LAST ACADEMIC PROPOSALS (2006 TO 2009) 25 3.2.1 Written corpus acquisition 26 3.2.2 Parallel treebank acquisition 27 3.2.3 Lexical acquisition 27 3.2.4 Grammar acquisition 29
4 CONCLUDING REMARKS AND RECOMMENDATIONS OF THE FIRST VERSION. 30
5 BIBLIOGRAPHY 33 5.1 BOOKS 33 5.2 ARTICLES (BESIDES THOSE DESCRIBING THE LR STUDIED) 33 5.3 ELECTRONIC RESOURCES (BESIDES ALL THE INDIVIDUAL WEBSITES OF THE LR STUDIED) 44
APPENDIX I: SOME REMARKS ON THE DESCRIPTION AND STANDARIZATION OF RESOURCES NOWADAYS 45
D6.1a – Survey and assessment of methods for the automatic construction of LRs
3/46
1 Introduction
This deliverable aims at providing an overview of the current state of the art LRs in Europe with a special focus on their automatic construction. The first part of the document presents a survey of the most demanded resources, which are used as the core element of some NLP applications. This survey is complementary to D2.1 and has been reported here because it is instrumental to the reflections on the available techniques of LR production. The second part of this document is devoted to an overview of the current techniques for automatic construction of LRs. In this part, we have included a survey on the last academic proposals for automatic acquisition and production of LRs in order to confirm the interest that these topics raise in the community of researches, and as the informa‐tion to start a classification of methods and resources addressed. The last section of this document is a summary of the main conclusions derived from the study formulated as recommendations for the future steps within the FLaReNet consor‐tium and the community of Language Resources. Finally, we start in this deliverable the drafting of a comprehensive bibliography for methods, techniques and tools for the automatic acquisition of linguistic resources.
2 Results of the study on available linguistic resources for European languages and technologies, description and compliance with standards. Identification of the most demanded and the most urgent resources.
2.1 Initial considerations and scope of the study
Before going into the inventory of the current techniques for the automatic production of LRs we carried out a survey on existing linguistic resources to identify those which are missing, those which are most developed in order to focus what resources are more demanded and more urgently to be produced. The purpose of this preliminary survey was to investigate to have information to decide to which extent it is possible to produce the resources demanded automatically. In this section, we will discuss the results of such survey and the main conclusions reached. A first element which needs to be pointed out is related to the complexity and diversity of existing linguistic resources. As it may be observed in figure [1], we do not only ob‐serve the existence of different kinds of resources, but also that these resources may have different characteristics. For the purposes of this section, we will simplify the list of possible resources down to eight types of linguistic resources which, in turn, may be subdivided depending on their intrinsic characteristics. The categories we use are grammars, lexical resources, corpora (oral, written, multimodal, transcribed), treebanks, ontologies, acoustic models, translation models, language models.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
4/46
Figure 1 Overview of the types of existing resources
The compilation of information for this first survey was harder than expected because of the lack of documentation for most of the resources surveyed. Besides, the availability of the resource itself is problematic: Sometimes a resource found in one of the cata‐logues/repositories is no longer available or simply impossible to be found; sometimes it is only possible to find a paper reporting on some aspects of it; and, finally, sometimes the information is distributed among different websites, documents or papers at confer‐ences. This makes it really difficult to carry out an efficient and consistent study, as the information found is not always coherent (e.g. not every corpus specifies the number of words it has) and sometimes it even differs from the one found in the cata‐logues/repositories. Bearing in mind these two constraints, time and lack of adequate documentation, we de‐cided to carry out a study of a significant number of resources taken from the CLARIN2 and ELRA3 repositories. Concretely, we have studied a total of 728 resources represent‐ing approximately 46% of the data in the CLARIN and 31% of the ELRA repositories4, consulted by September 2009. Obviously, there is some overlap between the two cata‐ 2 http://www.clarin.eu/view_resources and http://www.clarin.eu/view_tools. 3 http://catalog.elra.info/search.php and http://universal.elra.info/search.php. 4 Please, bear in mind that both repositories are regularly updated and maintained and, thus, the percent‐ages have been lowered in comparison with the time in which the data were collected and analysed (in November 2008 we took the whole CLARIN repository, but as to September 2009, the percentage has been reduced to 46% of the whole repository as more resources have been added it).
D6.1a – Survey and assessment of methods for the automatic construction of LRs
5/46
logues, but for our purpose, the same resource was counted only once. The resources studied are distributed as follows: 127 resources (17%) are present in both the ELRA and CLARIN repositories; 332 (46%) are found only in CLARIN and 269 (37%) are found only in ELRA.
Figure 2 LR distribution in the study carried out
Figure [2] shows the distribution of the 728 resources covered by our study. As it may be acknowledged, the most common resources are corpora (written corpora, 39%, oral corpora, 21%) and lexical resources (19%). Since our study does not cover all existing resources in Europe, we have prepared the same graphic for the overall CLARIN and ELRA registries (see figures [3] and [4] below).
Figure 3 LR distribution in the CLARIN repository
D6.1a – Survey and assessment of methods for the automatic construction of LRs
6/46
Figure 4 LR distribution in the ELRA repository
As it may be confirmed after observing both graphics, written corpora are the most fre‐quently existing LRs (34% of the CLARIN repository and 51% of the ELRA one), followed by lexical resources (24% and 16%, respectively), oral corpora (13% and 17%) and tools (15% and 5%). However, we were not only interested in finding out the most fre‐quent linguistic resources. As mentioned before, this study aimed at finding out which are the most widely used LRs in applications and thus to focus on the automatic tech‐niques to produce them. If we have a look at the possible applications offered by computational linguistics, these may be summarised as reported in figure [5]:
Figure 5 Applications overview
D6.1a – Survey and assessment of methods for the automatic construction of LRs
7/46
Obviously, to create such applications linguistic resources, tools and technologies are required. The linguistic resources involved will be those referenced to in figure [1], whereas the tools can be seen in figure [6] below:
Figure 6 – Tools overview
Figure [6] is just an overview of the different existing tools5. Each tool may be used in‐dependently or combined, depending on the application that makes use of them and their intrinsic characteristics. Furthermore, some of them may require that another one is used before, and some tools might have been already designed as workflows of differ‐ent “atomic” tools in order to produce a concrete result. Finally, these tools are developed using certain specific technologies and algorithms such as Hidden Markov Models (HMM), n‐gram models, etc. In figure [7] the main tech‐nologies behind the different tools and applications are shown.
5 The inventory of tools follows the ones proposed by Jurafsky, Daniel, and James H. Martin. 2009.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
8/46
Figure 7 Technologies overview
2.2 Applications related to written text
In this section we will concentrate on the linguistic resources which are crucial for the creation of NLP systems for the analysis of written texts and compare to what extent the required LRs for the good performance of these systems are currently documented as being used. We concentrated on two case studies: machine translation and information extraction systems. Figures [8] and [9] below summarise the results of this comparison.
Figure 8 Resources used in Machine Translation
Observing the three main approaches to machine translation (rule‐based, statistical and hybrid) it can be easily acknowledged that lexical resources, written corpora and grammars are essential for the development of machine translation applications. During our study, we also gathered information on the intended or actual usage of the resources under study. However, this information was not usually accessible and few
D6.1a – Survey and assessment of methods for the automatic construction of LRs
9/46
resource providers indicated this data. In the table included in the bottom of figure [8] we report the results of this survey. Although we know that lexical resources, corpora and grammars are required for the implementation of machine translation applications, few resources from our survey indicate that they are used/intended to be used for ma‐chine translation (9,7%, 6,04% and 0% respectively). The highest percentage was ob‐tained by treebanks (12,76%). These results can show either that the lack of proper documentation hides the use of other types of resources for machine translation, or the lack of consciousness as regards to possible exploitation of a resource created for other purposes and/or the under‐exploitation of the resources already available. Figure [9] shows the same data for Information Extraction systems. In this case, it is clear that lexical resources are the most demanded resources for the creation of IE ap‐plications, and that written corpora are also used in the development of question‐answering, topic detection and tracking and indexation systems.
Figure 9 Resources used in Information Extraction
In the table at the bottom of Figure 9 we report the percentages of the resources identi‐fied in our preliminary survey indicating which are used/intended to be used in IE ap‐plications. The results are very similar to the ones obtained for the machine translation case study. Only 8,95% of the lexical resources indicate that they can be used for Infor‐mation Extraction in general; 2,98% indicate they may be/are used in Information Re‐trieval, 0% in Summarisation and 1,5% in Question‐Answering.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
10/46
These results show again, either lack of documentation or the lack of consciousness as regards to further exploitation of a resource created for other purposes and/or the un‐derexploitation of the resources already available.
2.3 BLARKs and most demanded resources
Another important source of information besides our own study is the concept of BLARK – Basic Language Resource Kit (Krauwer, 1998) and the already existing BLARK matri‐ces that we reproduce below taken from Krauwer (2003). As expressed by Krauwer (2003), the BLARK is defined “as the minimal set of language resources that is necessary to perform any precompetitive research and education at all. The definition is in princi‐ple intended to be language independent, but as specific languages may come with dif‐ferent requirements, instantiations of the BLARK may vary in some respects from lan‐guage to language”. In principle, the BLARKs are materialised in the form of matrices in which data (linguis‐tic resources), modules (tools and simple NLP applications) and applications (complete NLP applications) are interrelated showing the importance they have for each other. Figure [10] shows a “basic” BLARK Matrix. In it we are provided with an overview of the importance of data for modules and the importance of modules for applications.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
11/46
Figure 10 BLARK table6
If we take a closer look at the BLARK matrix, we will realise that the data included in the matrix are basically lexical resources and corpora. If we now focus on the relations be‐tween data and modules, both in Language and Speech Technology, once more it is clear that the most required linguistic resources are lexical data and corpora. Specifically, al‐most all modules under Language Technology require annotated corpora and monolin‐gual lexica, and not‐annotated corpora seem to play an important role too. As far as the modules under Speech Technology are concerned, the most important resources, which are in all modules, are oral corpora and monolingual lexica. They are followed by anno‐
6 S. Krauwer (2003). The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
12/46
tated corpora, present in all modules as well, but to which a lesser importance is given. For these kinds of modules, multimodal corpora, multimedia corpora, multilingual lexica and multilingual corpora are also very much important.
2.4 Main conclusions derived from the study. Most demanded and most urgent resources
So far we have described the study carried out and already sketched some of the main conclusions that may be derived from there. In this section, we will concentrate on the most demanded and most urgent resources according to our survey. As it may be clearly concluded from what we have seen so far, the most required re‐sources in NLP applications and technologies are lexical resources and corpora. These two kinds of resources are used by/for almost every NLP application. As these resources are also the most frequent ones not only in the survey carried out but also in the CLARIN and ELRA repositories, we may think that these resources are duly covered by existing resources and no further effort for their production is needed. How‐ever, things are more complicated when going into the details of these two elements:
- Not all EU official languages are equally covered, let alone minority or regional languages.
- The existing resources have rather low coverage levels (reduced amount of in‐formation).
- Not all domains are covered, i.e. resources are not appropriately domain‐tuned. - Language changes continuously: corpora and lexica need to be updated or rebuilt
often. In the table below we offer the figures obtained in our study for all 23 official EU lan‐guages: we report on the number of resources divided by type per language.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
13/46
32144Swedish
103167Spanish
1070Slovene
2050Slovak
67143Romanian
4186Portuguese
2254Polish
1122Maltese
21104Lithuanian
2348Latvian
44197Italian
1010Irish
2596Hungarian
4395Greek
48227German
107297French
25116Finnish
13129Estonian
1926931English
5101516Dutch
3263Danish
22151Czech
34206Bulgarian
Multilingual Lexical Resource
Monolingual Lexical Resource
Written Multilingual Corpus
Written Monolingual CorpusEU official Language
32144Swedish
103167Spanish
1070Slovene
2050Slovak
67143Romanian
4186Portuguese
2254Polish
1122Maltese
21104Lithuanian
2348Latvian
44197Italian
1010Irish
2596Hungarian
4395Greek
48227German
107297French
25116Finnish
13129Estonian
1926931English
5101516Dutch
3263Danish
22151Czech
34206Bulgarian
Multilingual Lexical Resource
Monolingual Lexical Resource
Written Multilingual Corpus
Written Monolingual CorpusEU official Language
The histograms in [11], [12], [13] and [14] offer a clearer picture of the language distri‐bution among the 23 EU official languages .
Written Monolingual Corpus
6
13
16
31
96 7 7
5 6
0
7 84
24
63
0 0
74
0
5
10
15
20
25
30
35
Bul
garia
n
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 11 Monolingual Written Corpora
D6.1a – Survey and assessment of methods for the automatic construction of LRs
14/46
Written Multilingual Corpus
2015
615
69
12 11
2922
9 91
19
410
2 5 814
5 716 14
0
10
20
30
40
50
60
70
80
Bul
garia
n
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 12 Multilingual Written Corpora
Monolingual Lexical Resource
4
2 2
10
23
5
78
3
5
0
43
1 12
1
7
0 0
32
0
2
4
6
8
10
12
Bul
garia
n
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 13 Monolingual Lexical Resources
D6.1a – Survey and assessment of methods for the automatic construction of LRs
15/46
Multilingual Lexical Resource
32
35
19
12
10
4 42
1
42 2
12
46
21
10
3
02468
101214161820
Bul
garia
n
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 14 Multilingual Lexical Resources
As we can see, English is in general the most represented language, followed by Dutch. However, we observe also that the distribution varies slightly depending on the type of resources considered. An full interpretation of such data is beyond the initial scope of this survey and may be a nice topic for discussion within the FLaReNet community. As our survey cannot be considered to cover all possible language resources in Europe – given that we have used only two, albeit large, catalogues/repositories – we have also surveyed the same kind of data in the SHACHI catalogue7 just to confirm the tendencies of the results. SHACHI is a Japanese, which claims to cover the main language resource consortia worldwide by cataloguing all their resources (see figure [15] for details). The aim of this initiative was to access and catalogue language resources (included ELRA, for instance) by means of metadata and therefore constitutes for us a way of gathering an‐other useful overview on this the distribution of LRs.
7 http://facet.shachi.org/.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
16/46
Figure 15 List of major language consortia which SHACHI covers8
Figures [16], [17], [18] and [19] show the results obtained by querying the SHACHI da‐tabase. As it might be expected, English is the language which counts with most re‐sources while many of the EU official languages are lacking some of those resources which are mostly required for the creation and exploitation of various NLP applications. Moreover, from the analysis of the SHACHI catalogue we once again learn that the in‐formation we have on language resources is very much scattered and sometimes inco‐herent. A special effort should, therefore, be put in making access easier and more ho‐mogeneous to both language resources themselves and their documentation, a recom‐mendation beyond the scope of this study, but worth to be made.
Written Monolingual Corpus
211 4 9
143
4 415 10
3 2 2 5 0 2 0 311
1 1 3 9 7
0
20
40
60
80
100
120
140
160
Bul
garia
n
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 16: Written monolingual corpora in SHACHI
8 Tohyama et al. (2008).
D6.1a – Survey and assessment of methods for the automatic construction of LRs
17/46
Written Multilingual Corpus
9 11 915
113
10 6
26 22
6 81
18
4 7 3 316
7 3 7
2413
0
20
40
60
80
100
120B
ulga
rian
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 17: Written multilingua corpora in SHACHI
Monolingual Lexical Resource
1 15
23
55
1 1
25
9
2 1 0
13
0 0 0 1 2 0 0 04 2
0
10
20
30
40
50
60
Bul
garia
n
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 18: Monolingual lexical resources in SHACHI
D6.1a – Survey and assessment of methods for the automatic construction of LRs
18/46
Multilingual Lexical Resource
3 3 2 14
319
2 2
92 86
1 2 1 12 0 1 0 0 3 3 0 4
228
10
50
100
150
200
250
300
350B
ulga
rian
Cze
ch
Dan
ish
Dut
ch
Eng
lish
Est
onia
n
Finn
ish
Fren
ch
Ger
man
Gre
ek
Hun
garia
n
Irish
Italia
n
Latv
ian
Lith
uani
an
Mal
tese
Pol
ish
Por
tugu
ese
Rom
ania
n
Slo
vak
Slo
veni
an
Spa
nish
Sw
edis
h
Figure 19: Multilingual lexical resources in SHACHI
The last point that shall be pointed out in this section is that the mere existence of lan‐guage resources in a language does not necessarily mean that these resources are useful or fully exploitable. Depending on the purpose they should be used for, corpora and lexica may be rather small, thus creating coverage problems, or may be of insufficient quality, thus impairing system performance. Related to the issue of size, we have the question of granularity: language resources tend either to be very specific or too gen‐eral, and thus domain tuning becomes really hard if not impossible when trying to unify several resources into one, or to build a derivative resource for a given task by drawing information pieces from other resources. This is closely related to the further problem of annotation and encoding formats and granularity. All these issues become even more crucial when tackling the automatic production, maintenance, integration of resources, as it requires certain decisions to be taken from start. Therefore, issues related to size, granularity and coverage in general that deserve a thorough confrontation and discus‐sion within the community should be born in mind in the future and, if possible, solu‐tions for it should be found.
3 Current techniques for the automatic production of the most demanded and most urgent resources.
In this section we will concentrate on the current techniques for the automatic produc‐tion of language resources, as well as for methods for the acquisition, extraction and an‐notation of specific types of linguistic information. In 4.1, we will review the state of the art of available technologies; in 4.2 we will survey the most recent trends in this field by reviewing the latest research papers presented at the major international conferences for computational linguistics and language resources.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
19/46
3.1 Currently available technologies
This section focuses on three main areas: corpus acquisition and annotation, technolo‐gies for parallel corpus construction, and lexical acquisition. In lexical acquisition, we in‐clude the production of bilingual dictionaries and terminologies, as well as the acquisi‐tion of specific lexical information (e.g. subcatgorization frames). We considered these three main areas precisely because it where we can find are al‐ready tested techniques, which in turn can be taken as a confirmation that these are the most demanded resources and therefore it is not by chance that researchers have tried to reduce the time and costs that their creation.
3.1.1 Written corpus acquisition and annotation When we talk about corpus acquisition, we mainly refer to two different kinds of cor‐pora: monolingual and bilingual/multilingual corpora, which in turn may be aligned or just parallel. In this subsection, we will concentrate on the automatic acquisition of monolingual corpora and their subsequent automatic annotation (when needed). It has become widely acknowledged that the World Wide Web offers researchers easy access to data of very diverse type and that these data may be collected in the form of corpora that can be further exploited. This change in the access to data has made it pos‐sible for researchers to manually compile ad hoc corpora according to their specific needs. Several authors have already studied the web as a source for corpora creation (Jones and Ghani, 2000; Fujii and Ishikawa, 2000; Kilgarriff and Grefenstette, 2003; Baroni and Bernardini, 2004, to mention just a few). Web crawling at different levels and with different types of data has become very common in computational linguistics for creating new corpora. The last tendencies try to explore the W3.0 and exploit the so called “semantic web” to compile better tuned corpora profiting of the metadata and semantic tagging used for describing the websites and pages to determine what docu‐ments are about and thus be able to compile corpora that fit specific needs. Some applications are already available that offer researchers the possibility of auto‐matically compiling corpora from scratch: e.g CorpusBuilder and BootCat . Corpus‐Builder (2004), is an application aimed at constructing corpora from the web for minor‐ity languages, it is written in perl and makes use of van Noord's TextCat system. Another application widely known is BootCat (Bootstrapping Corpora and Terms) is a toolkit de‐signed for bootstrapping specialized corpora and terms from the web (Baroni and Ber‐nardini 2004). Finally, the so‐called “new text” has also been subject of study. During LREC 2006, a Workshop9 dealt with this topic, pointing out that wikis and blogs and other dynamic text sources are nowadays a source of text for creating corpora. However, little specific and further work and especially evaluation of the exploitation of such means can be found. Before introducing the next section, we should also point out the fact that automated text tagging (annotation of POS, syntactic structures, semantic and pragmatic informa‐ 9 http://www.sics.se/jussi/newtext/.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
20/46
tion, etc.) is also an important subject of study and research, and numerous applications are already available, at least experimentally. The problem that may arise when we wanted to export these applications to a real‐world scenario is one of input/output re‐quirements and formats.
3.1.2 Parallel corpus technologies This section will elaborate on parallel corpus technologies: technologies used for auto‐matically building bilingual and/or multilingual corpora (usually aligned at some level). Parallel corpora are one of the most used resource types in computational linguistics, especially in commercial applications. They are in fact a key element in several cross‐language applications such as machine translation, bilingual lexicon extraction and transfer grammar rule induction. Parallel corpora usually are aligned, at least at sen‐tence level, and techniques on sub‐sentential alignment have been matter of vast re‐search. Sub‐sentential alignment refers to the alignment process of translational corre‐spondences at word or chunk level given a bilingual corpus which is sentence‐aligned and the better the word and chunks alignments, the better the phrase pairs for phrase‐based statistical machine translation or transfer rules. We may distinguish several techniques for sub‐sentential alignment, but mostly they can be classified into three major types: generative models, discriminative word alignment and heuristics‐based methods. Methods based on generative models The first of the techniques we will be making reference to are the generative models. These consider the alignment process as the generation of a word in one language from another one. Among the authors researching this field we can highlight the work of Brown et al. (1993), in which they describe the 5 IBM models. Three years later, Vogel et al. (1996) aimed at making alignment probabilities depend‐ent on the differences in the alignment positions rather than on the absolute position. They used the approach used in speech recognition to solve the time alignment prob‐lem: first‐order Hidden Markov Model (HMM) for word alignment. The aim was to make the alignment probabilities dependent not on the absolute position of the word align‐ment, but on its relative position. However, the most currently used implementation is the Giza++, introduced by Och and Ney (1993). It implements the generative models initially developed at IBM (1‐5) and some of the extensions of these models. Finally, among these generative models we point out also the MTTK (Deng and Byrne 2006), which uses HMM word‐to‐phrase alignments. Discriminative word alignment The second type of techniques mentioned involves discriminative word alignment. This approach allows various features to be encoded in the input data (such as POS tags or syntactic dependency relations). The models under this approach were developed to overcome the shortcomings faced by generative models and have the advantage that they require a relatively small amount of annotated word alignment data for training.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
21/46
Among the authors who followed this approach, we shall name Liu et al. (2005), who ar‐gue that finding word alignments is especially hard when two languages widely differ in word order and that for that reason it is necessary to incorporate all useful linguistic in‐formation to alleviate these problems. They propose as a solution to this problem the usage of log‐linear models, as these allow statistical models to be easily extended by in‐corporating additional syntactic dependencies. Moore (2005) suggests a discriminative approach to train simple word alignment mod‐els, which are easy to add features to and allow fast optimization of model parameters using small amounts of annotated data. According to their results, this model is compa‐rable in accuracy to more complex generative models. Finally, Ma et al. (2008) introduce a word alignment framework that facilitates the in‐corporation of syntax encoded in bilingual dependency tree pairs. They propose a model which consists of two sub‐models: an anchor word alignment model which aims to find a set of high‐precision anchor links and a syntax‐enhanced word alignment model which focuses on aligning the remaining words relying on dependency information invoked by the required anchor links. Their results highlight again that the incorporation of syntax into word alignment is positive and improves the recall. Heuristicsbased methods By the third type of approaches, word alignment is arrived at by using similarity func‐tions. Smadja et al (1996), for example, propose a system (Champollion) which given a pair of parallel corpora in two different languages and a list of collocations in one of them, automatically produces their translations using statistical analysis of corpora. Ker and Chang (1997) propose an algorithm capable of identifying the translation for each word in a bilingual corpus combining word‐based statistics with the exploitation of lexi‐cographic resources. Based on the results of their research, they argue that a more suc‐cessful alignment can be achieved using a class‐based approach. For this reason, they propose a system in which thesauri and corpora are used in combination to overcome generality and efficiency problems. Finally, Melamed (2000) explores new ways of translation model biases alternative parameter estimation strategies and techniques for exploiting pre‐existing knowledge that may be available about particular languages and language pairs. Alignment at chunk level As far as chunk alignment is concerned, the word packing approach bootstraps word alignments via optimising word segmentations. Ma et al. (2007) investigate this issue and suggest a method to pack words together. Their aim was to give a different and sim‐plified input to automatic word aligners. To this aim, they use a bootstrap approach in which they first extract 1‐to‐n word alignments with an existing aligner and then esti‐mate the confidence of those alignments in order to determine whether the n words have to be grouped. If so, the group is considered a new basic unit. Finally, they re‐apply the word aligner to the updated sentences and evaluate the results, which turn out to be very promising. Tree structure alignment
D6.1a – Survey and assessment of methods for the automatic construction of LRs
22/46
Finally, there is another type of approaches in which syntactically annotated data, either on the source side or both on source and target sides, are used to align tree structures. An example of this approach is Inversion Transduction Grammar (Wu, 1997), which per‐forms synchronous parsing on bilingual sentence pairs in order to establish transla‐tional correspondences. Yamada and Knight (2001), instead, propose a tree‐to‐string alignment that aligns a source tree to a target string, whereas Tinsley et al. (2007) pro‐pose a tree‐to‐tree alignment that aligns a source tree to a target tree directly.
3.1.3 Lexical acquisition This section is dedicated to methods for the extraction and induction of bilingual dic‐tionaries / terminology on the one hand, and acquisition of lexical information on the other, both important for many NLP.
3.1.3.1 Extraction and induction of bilingual dictionaries / terminology As it emerges from the first section of this deliverable, bilingual lexica are key resources for various NLP applications. They have a central role in machine translation systems, cross‐language information retrieval, multilingual information extraction and other mul‐tilingual applications as they contain the equivalents of words for two languages. How‐ever, bilingual lexica such as machine‐readable dictionaries are not equally available for all language pairs and the problem increases when minority languages are involved. As manual construction of bilingual lexica is rather highly costly, during the last two decades research on bilingual dictionaries construction has focused on the automatic ex‐traction of bilingual lexica using statistical analysis of parallel corpora. Corpora, in fact, offer the necessary linguistic evidence for creating translational equivalents of words, no matter whether they are bilingual, monolingual, parallel or comparable, aligned or annotated. Many authors have carried out relevant research in this field, e.g. Gale and Church 1991, Dagan et al. 1991, Gale et al. 1992, Dagan 1993, Kupiec 1993, Fung 1995, Smadja et al. 1996, Melamed 1997, Kilgarrif 1997, Brown 1997, Tiedemann 1998, Piperidis et al. 2000, and Tufis 2002. Brown et al. 1993, Gale and Church 1991, Hiemstra 1997 and Smadja et al. 1996 are among those reporting on statistical approaches proposed for building bilingual lexica. The common underlying idea consists in using a sentence‐aligned parallel corpus to compute the association score between word pairs, which en‐ables the extraction of correlations of words. The proposal of efficient and powerful text‐alignment algorithms (Brown et al. 1993, Gale and Church 1991) as reported in the previous subsection has enabled the evaluation of word correspondences, leading to word‐level alignment of bitexts and consequently to the extraction of bilingual lexica. Finally, recent approaches have also highlighted that the quality of the corpus is more important than quantity, as this ensures the quality, for instance, of the acquired termi‐nological resources (Daille 2007), and of the translation models, etc. The issue of the quality of Language Resources that are in turn input to the application of techniques that derivate new resources should also be addressed in the future, and when talking about evaluation this uses should also be taken into account.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
23/46
3.1.3.2 Acquisition of lexical information Another important research topic within lexical acquisition is the acquisition of lexical information, especially in the context of high level information monolingual lexica. The research in this field started in the late 1980’s and several important projects aimed at acquiring lexical information from machine readable dictionaries (MRDs). As these pro‐jects turned out to be pretty successful, active work in the field was carried out. In the 1990’s there was a shift in the emphasis and the focus turned towards corpus‐based ap‐proaches as large corpora had become available along with statistical NLP techniques required for their robust and accurate processing. Nowadays major advances have been made in many areas of lexical acquisition, e.g. terms, collocations, subcategorization frames, word senses, (lexical‐)semantic classes, semantic relations, selectional prefer‐ences, multiword expressions (see McCarthy 2006 for a detailed review). Furthermore, several EU projects aimed at the creation of large lexica automatically, such as ACQUILEX10 (Acquisition of Lexical Knowledge for Natural Language Processing Systems, 1989‐1992), which focused on the derivation of lexica from the MRDs, and ACQUILEX II (1993‐1995) which made a considerable use of corpora as a further source of data for semi‐automatic construction of lexical resources. The project SPARKLE11 (1995‐1996) had among its aims developing a lexical acquisition system capable of learning the aspects of word knowledge from free text, as the creation of such tools would make it possible to build sufficiently rich NLP lexicons in a cost‐effective manner. The MEANING12 project (2002‐2005) and VERBMOBIL13 (1996‐2000) included besides research in the acquisition of lexical information, research in the acquisition of bilingual lexica. These three last projects showed the feasibility of automatically acquiring LRs, but only for a particular scenario.
3.1.3.2.1 Acquisition of grammatical information From the linguistic point of view, according to the distributional hypothesis, words are used in context by virtue of their syntactic properties. Syntactic properties would ex‐plain why, in a particular language, any random sequence of words cannot be consid‐ered to be a well‐formed or valid sentence. This principle guides linguists to propose syntactic features to characterize the behaviour of particular words, grouping them in types of words or categories. The ability of specifying the syntactic features of every word would thus allow an accurate parsing of the sentence where these words occur. Such parsing information is also crucial for defining relations among things as required by applications such as Machine Translation, Natural Language Interfaces, Question An‐swering, Topic Detection and Tracking, Information Extraction, Grammars for Language understanding, etc.
The most successful systems of lexical acquisition are based on this linguistic idea that the contexts where words occur are associated to particular lexical types. The use of
10 http://www.cl.cam.ac.uk/research/nl/acquilex/. 11 http://www.ilc.cnr.it/sparkle/sparkle.html. 12 http://www.lsi.upc.es/~nlp/meaning/meaning.html. 13 http://verbmobil.dfki.de/overview‐us.html.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
24/46
analysis tools to mine corpus data is one of the most common methods in lexical acquisi‐tion. The different systems can learn very specialized linguistic information such as sub‐categorization frames14 (for example, Brent (1993), Ushioda et al. (1993), Briscoe and Carroll (1997), Korhonen (2002) for English; Schulte im Walde (2002) for German, and Manning (1993) for complement’s prepositional restrictions). Other types of linguistic features that may be acquired from corpora is noun countability, addressed for example by Baldwin and Bond (2003). More recently, Carroll and Fang (2004) and Baldwin (2005a) focus on lexical information acquisition for HPSG based computational gram‐mars.
Baldwin (2005b) has classified lexical acquisition methods into two categories: in vitro and in vivo methods. In in vitro methods, a secondary language resource is used to obtain abstract information of the words to acquire. This knowledge is then used as back‐off in‐formation. In in vivo methods, a component of the target resource is used to model simi‐larity to which the created lexicon can be compared. Korhonen (2002) is an example of in vitro methods. The main problems faced by lexical acquisition are mainly concerned with: 1) the removal of noise in automatically acquired data, specifically because it is difficult to discriminate relevant information by statistical means when dealing with low frequency phenomena/items/structures, the most frequent cases according to the Zipf law. 2) The generalization capability required to predict lexical features that have not been seen in the examined corpus. Currently, subcategorisation Frames (SCF) acquisition systems (Preiss et al. 2007) are capable of learning large‐scale, fine‐grained SCF (frequency) information at 70‐85 F‐measure when this information is extracted from raw corpus data using statistical pars‐ers and rule‐based classifiers. The systems for Selectional Preferences (SP) acquisition, instead, obtain 60‐65 F‐measure when compared in the context of a pseudo‐disambiguation task (Bergsma et al. 2008). This performance level however can be ob‐tained using a variety of techniques. Furthermore, the current systems for lexical classi‐fication (Sun et al. 2007 and Joanis et al. 2007) yield 60‐70 F‐measures when applied to cross‐domain datasets. These systems employ machine learning (ML) techniques to classify syntactic features extracted from corpora using e.g. part‐of‐speech tagging or robust statistical parsing techniques. The best systems for acquiring Multiword Expres‐sions (MWEs) vary wildly according to the MWE type targeted but in general achieve ac‐curacy rates similar to the above mentioned methods (Villavicencio et al. 2005).
3.1.3.2.2 The acquisition of semantic information The first attempts addressed the question of acquiring the meaning of words but also in‐cluded information like semantic features such as “animate”, “male”, etc., which consti‐tuted a description that could be handled by means of similarity measures in a feature space. The collection of features to describe meaning gave also rise to proposals like the one by Charniak (1999), who suggests to model concepts, i.e. meanings, as vectors and to handle them by clustering techniques based on distance functions. These approaches 14 Given the argument‐adjunct distinction, subcategorization concerns the specification ,for a given predi‐cate, of the number and type of arguments required for well‐formedness.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
25/46
were very much indebted to decompositional semantic approaches that required the definition of features that could be considered primitives in the model, and the charac‐terization of each lexical entry by means of such features. To reduce the workload and obtain the same results, simpler approaches were developed: use of word co‐occurrences instead of features to calculate similarity and/or distance. These techniques were very successful in applications such as Information Extraction. Moreover, there were also some psychological results supporting such a view in the theory of “lexical as‐sociations” (Lund, Burgess et al. 1995). This was the origin of the use of techniques known as “bag of words”: to use neighbouring words to describe the meaning of words, and were the first step for defining the acquisition from large quantities of text, i.e. cor‐pus. There were different studies which differ with respect to the methodology with which the surroundings of a word is analyzed: n‐words left and n‐right, to get rid of stop‐words, select only content words with a certain presence in the domain, etc. (i.e. dimensionality reduction), using probabilities, etc. According to the distributional hypothesis (Harris, 1954), two words are similar if they appear in similar contexts, and this idea guided several attempts to cluster words ac‐cording to syntactic information as the work done by Pereira and Thishby (1992). La‐tent Semantic Analysis (LSA, Landauer et al. 1997) was also firstly used for information extraction to compare documents in what they called “conceptual space”; they were used to find synonyms and hyponyms. However, all these methods failed in the task of lexical acquisition, partially because of the dimensionality problem, partly because a model of the structure of lexical meaning seems to be still missing. Much of the research on lexical acquisition has focussed on small‐scale experiments and therefore their usability in applications is largely to be demonstrated yet. The challenges that the acquisition of lexical information is facing nowadays are:
- Accurate, large‐scale and portable acquisition techniques. - Multilingual lexical acquisition. - Large‐scale application to build and tune existing lexical resources and aid im‐
portant real‐world application tasks. Further research in approaches for domain‐tuning of acquisition techniques (Korhonen et al. 2008) shall be done. Furthermore, we face again the problem of languages. Whereas English is widely covered, another important field to be covered is the one of the multilingualism of the approaches (Bel et al. 2007, Quochi et al. 2008, Schulte im Walde 2002). However, we shall point out that the research on acquisition of lexical in‐formation has proliferated in the latest years and therefore a lot of work is currently be‐ing carried out in this field.
3.2 A survey of the last academic proposals (2006 to 2009)
In this last section, we attempt to report on the first step for obtaining an inventory of methods and resources addressed by the academic community in general. While there are several catalogues for resources and tools/applications, there is nothing similar ex‐plicitly aiming at monitoring the existence and status of methods or tools for the auto‐matic acquisition of linguistic resources. The aim of this last survey is to attempt to clas‐
D6.1a – Survey and assessment of methods for the automatic construction of LRs
26/46
sify and identify the areas where our community concentrates more efforts. In a future update of this document, we will include a classification by approach/methods applied in each case. We surveyed here the works carried out in the last years (from 2006 to, when available, 2009) as published in some of the major conferences in our field, in order to derive the recent research tendencies. Thus, we have taken as relevant conferences the last main conferences related to Computational Linguistics: COLING 2008; COLING/ACL 2006; ACL 2007; ACL 2008; ACL 2009; LREC 2006; LREC 2008; EACL 2006 and EACL 2009. We proceeded as follows: firstly we surveyed all programs and extracted all relevant pa‐pers/presentations related to the automatic acquisition of linguistic resources, and then classified the articles per types of resource involved. Like this, we obtained an overview of the development of this research field along the timeline marked by the different con‐ferences. Figure [20] below shows this evolution. Then, we focused on the papers dating from 2008 and 2009 (whenever the proceedings were available). The rest of this section will be divided in types of resource: corpora, treebanks, lexica Other language resources such as speech and sign languages will be dealt with in the next version of this document.
Overview papers on automatic acquisition 2006-2009
05
101520253035404550
2006-2007 2008-2009
YEAR
Nº P
APER
S
Treebank Acquisition Written Corpus AcquisitionBilingual Lexica/dictionaries/terminology acquisition Lexical Information AcquisitionGrammar Acquisition
Figure 20 Research papers evolution15
3.2.1 Written corpus acquisition As in the previous section, we may distinguish between technologies for the extraction of monolingual corpora (Mohler and Mihalcea 2008, Huang et al. 2008, Nazar et al. 2008, Evert 2008 and Itamar and Itai 2008), and parallel corpus technologies (Maeda et al. 2008, Halácsy 2008, Lardilleux and Lepage 2008, Kuzman et al. 2008, Lin et al. 2008, Li et al. 2008, Cromieres and Kurokashi 2009 and Vu et al. 2009). 15 As odd years have less conferences than even years we have grouped them in a two‐year basis to re‐flect better the evolution.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
27/46
Papers on monolingual corpus acquisition include information on: • web‐crawling techniques for low‐density languages (Mohler and Mihalcea 2008), • heuristic approaches to improve the annotation quality of very large corpora
(Huang et al. 2008), • simple web crawling and statistical analysis techniques (Nazar et al. 2008), • web crawling including cleaning systems by means of n‐grams (Evert 2008), • speech‐text alignment using the Gale and Church’s alignment algorithm (1993) to
build corpora out of movie subtitles (Itamar and Itai 2008). On the other hand, papers working on parallel corpora and alignment techniques report on
• the creation of sentence‐aligned parallel text corpora (Maeda et al. 2008); • the creation of gigaword corpora for medium density languages (Halácsy et al.
2008); • multilingual alignments by monolingual string differences using the Longest
Common Subsequence (LCS) (Lardilleux and Lepage 2008)16; • agreement‐constrained expectation‐maximization (EM) algorithm for supervised
alignment models (Kuzman et al. 2008); • parallel corpora construction from the web and word alignment algorithms (Lin
et al. 2008); • Incremental Hidden Markov Model (IHMM) alignment (Li et al. 2008); • the interest in the loopy belief propagation algorithm to train and use a simple
alignment model (Cromieres and Kurohashi 2009); • feature‐based methods to align documents with similar content across two sets
of bilingual comparable corpora from daily news texts (Vu et al. 2009).
3.2.2 Parallel treebank acquisition We include a section regarding parallel treebank acquisition even though only 1 paper has been found reporting on this matter (Zhechev and Way 2008). However, in this pa‐per it is explained how the need for syntactically annotated data for use in natural lan‐guage processing has increased dramatically in recent years because of it use for Statis‐tical Machine Translation and how this is especially true for parallel treebanks, of which very few exist. As the authors report in their paper, the existing ones are mainly hand‐crafted and too small for reliable use in data‐oriented applications. In their paper, they introduce a novel platform for fast and robust automatic generation of parallel tree‐banks, the software of which is capable of handling large data sets. Furthermore, they do not only present this platform, but also evaluation results, which demonstrate the qual‐ity of the derived treebanks.
3.2.3 Lexical acquisition This has turned to be the field in which a major proliferation in research has been pro‐duced in the last 4 years. Concretely, we will talk again about bilingual lexica/dictionaries and terminology extraction on the one hand, and lexical information
16 Given two strings A and B, it is always possible to find their longest subsequence. Such a subsequence is a sequence of not necessarily contiguous characters.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
28/46
acquisition on the other one and provide a general overview of the current academic proposals.
3.2.3.1 Dictionaries/Terminology A total of 33 papers have been found on this matter (25 in 2008 and 8 in 2009 in the conferences review, see section 4.2), which proves the proliferation of research on this field and the interest it has recently had for researchers. Obviously, we will not be sum‐marising all 33 articles, but just give a brief overview on the matters they cover to report on the current evolution of studies in this field. Among the different papers surveyed under this subsection, we can find approaches in which aligned linguistically motivated phrases are a useful means to extract bilingual terminology and more specifically complex multiword terms (Macken et al.2008). Oth‐ers suggest hybrid methods in which linguistic information is combined with heuristics or statistical measures (Boulaknadel et al. 2008, Lefever et al. 2009, Nerima and Wehrli, 2008, Baker and Brew, 2008, Deoskar and Rooth 2008, Michou and Seretan 2009, Sujai et al. 2009). Yang et al. have also made significant research on exploring new approaches for term extraction. In Yang et al. 2008 they proposed a new approach for term extraction using minimal resources and proposed another approach based on delimiters, in which term identification is carried out by finding their predecessors and successors as their bound‐ary words. In Yang et al. 2009 they propose another extraction approach using rele‐vance between term candidates calculated by a link analysis based method. Vivaldi et al. 2008 explore the way in which a term extractor may be tuned into a new domain, whereas Zhang et al. 2008 carry out a comparative evaluation of term recogni‐tion algorithms. In their paper, they point out that from a large number of methodolo‐gies available in the literature only a few are able to handle both single and multi‐word terms. After making a comparison among them, they propose a combined approach us‐ing a voting mechanism. They also highlight that choice and design of corpus has a major impact on the evaluation of term recognition algorithms. Buttes and Ciravegna 2008 propose the usage of similarity metrics for terminology rec‐ognition and Ha et al. 2008 and Haghighi et al. 2008 explore the usage of monolingual corpora to derive bilingual dictionaries. Other important approaches are those that explore the usage of a pivot language to de‐rive bilingual lexica from language pairs with scarce resources (Tsunakkawa et al. 2008, Kaji et al. 2008), as well as those that make use of Wikipedia or Wiki entries to construct dictionaries (Mausam et al. 2009, Wentland et al. 2008). Particularly interesting are ap‐proaches that focus on exploiting WordNet and FrameNet (Varga and Yokoyama 2009, Vintar and Fiser 2008, Alonso et al. 2008, Bond et al. 2008, Jikoun and Hofmann 2009). Finally, we can also mention an attempt to use cross‐lingual documents with similar contents to retrieve bilingual verb‐noun collocations (Fukumoto et al. 2008), the super‐vised usage of a machine learning algorithm for metric learning for synonym acquisition
D6.1a – Survey and assessment of methods for the automatic construction of LRs
29/46
(Shimizu et al. 2008) and the attempts to automatically yield sentiment/affect dictionar‐ies for sentiment analyses (Pitel and Grefenstette 2008 and Bestgen 2008).
3.2.3.2 Lexical Information Acquisition A major advance in the proliferation of research as regards to lexical information acqui‐sition may be observed, as the 30 papers (25 in 2008 and 5 in 2009 in the conferences reviewed, see 4.2 above ) found regarding this matter confirm it. Among the studies that may be carried out in the field of lexical information acquisition, we find out several subjects of interest:
- Semantic role assignment (Padó et al. 2008). - Word Sense Disambiguation (Cuadros and Rigau 2008, Apidianaki 2008). Among
the works carried out within this field, we find different approaches such as su‐pervised (Stevenson et al. 2008) and unsupervised models (Brody and Lapata 2008); the usage of other linguistic resources such as VerbNet (Abend et al. 2008) and the usage of Bayesian Word Sense Induction (Brody and Lapata 2009).
- Case frames (Kawahara and Uchimoto 2008). - Induction of information about the linguistic characteristics of lexical items (Bel
et al. 2008, Zesch et al. 2008, Tufis et al. 2008). - Lexical associations (Washtell 2009). - Subcategorisation frames (Korhonen et al. 2008, Ienco et al. 2008, Lenci et al.
2008, Lapshinova‐Koltunski and Heid 2008, Mohanty and Bhattacharyya 2008). - Class‐driven attribute extraction (van Durme et al. 2008, Kanzaki et al. 2008, Pas‐
ca 2009). - Surface patterns (Baghat and Ravichandran 2008), lexical reference rules
(Shnarch et al. 2009). - Relation extraction/induction (Yan et al. 2009, Roth and Schulte im Walde 2008,
Potrich and Pianta 2008, Paz et al. 2008, Lemnitzer et al. 2008, Dias‐da‐Silva et al. 2008).
3.2.4 Grammar acquisition As it appears clearly from the histograms above, automatic acquisition of grammars seems to be a flourishing new area of research, which did not emerge from the previous surveys based on catalogues. Therefore, we will include them as part of the state‐of‐the‐art of automatic acquisition of linguistic resources. It has to be noticed here that, since research works on this matter are usually presented at CoNLL, which we did not include in this first survey, this section has to be considered as highly preliminary, and will be significantly improved in the next version of the deliverable. Below we briefly describe the approaches to grammar induction. Snyder et al. 2009 propose an unsupervised multilingual grammar induction model. They research on unsupervised constituency parsing from bilingual parallel corpora. As they state, their goal is to use bilingual cues to learn improved parsing models for each language and to evaluate these models on held‐out monolingual test data. To achieve this purpose, they formulate a generative Bayesian model that seeks to explain the ob‐served parallel data through a combination of bilingual and monolingual parameters and also adapt a formalism known as unordered tree alignment to their probabilistic
D6.1a – Survey and assessment of methods for the automatic construction of LRs
30/46
setting. Finally, they perform inferences using Markov Chain Monte Carlo and dynamic programming. Ganchev et al. 2009 focus on dependency grammar induction via bitext projection con‐straints. They argue that whereas the broad‐coverage annotated treebanks necessary to train parsers do not exist for many resource‐poor languages, parallel texts availability and accurate parsers in English have enabled grammar induction through partial trans‐fer across bitext. They consider generative and discriminative models for dependency grammar induction that make use of word‐level alignments and a source language parser (English) to constrain the space of possible target trees. As they also point out, the main difference with previous approaches is the fact that their framework does not require full projected parses, but rather allows partial, approximate transfer through linear expectation constraints on the space of distributions over trees. Finally, it is also interesting to point out that they evaluate their approach on Bulgarian and Spanish CoNLL shared task data with good results. The third paper surveyed is based on variational inference for grammar induction with prior knowledge (Cohen and Smith 2009). These authors argue that variational expecta‐tion‐maximization (EM) has become a popular technique in probabilistic NLP with hid‐den variables and thus they describe a variational EM algorithm that uses a mixture model for the variational model. Afterwards, they refine the algorithm with an annealing mechanism to avoid local maxima and finally they show the effectiveness of the algo‐rithm on a dependency grammar induction task. The fourth and last paper on this matter reports on a gibbs sampler for phrasal syn‐chronous grammar induction (Blunson et al. 2009). They present a phrasal synchronous grammar model of translational equivalence. Instead of using heuristics or constraints from a word alignment model, they directly induce a synchronous grammar from paral‐lel sentence‐aligned corpora using a hierarchical Bayesian prior to bias towards com‐pact grammars with small translation units. Inference is performed using a novel Gibbs sampler over synchronous derivations. Finally, they argue that this sampler side‐steps the intractability issues of previous models which required inference over derivation forests and that instead each sampling iteration is highly efficient, allowing the model to be applied to larger translation corpora than previous approaches.
4 Concluding remarks and recommendations of the first version.
The first conclusion of our report is that techniques for the automatic acquisition and production of LRs are a very lively area in the research community, as it can be observed by the proliferation of research works we have shown. However, the data from Fig. [20] remains to be interpreted. as it shows, in addition to a shift of focus (from bilingual dic‐tionaries to monolingual lexica), a little reduction in the total number of papers, from 96 in the period 2006‐07 to 82 in the period 2008‐09, for the major conferences considered (see section 4.2).
D6.1a – Survey and assessment of methods for the automatic construction of LRs
31/46
Recommendation 1. To gather information about the reasons for shifts of interest, or identification of other underlying explanations, i.e. major conference’s focus on other topics. Some justified dissemination of the importance of the field must be undertaken. The second conclusion is that the evaluation of these techniques is of a variable quality. Some resources, for instance grammar induction, count on comparative competitions (CONLL), while others (e.g. acquisition of lexical information) are still in the status of very difficult comparison because of differences in data and evaluation methods (for ex‐ample, clustering cannot be compared with classification or induction). Comparisons among techniques should also be carried out, to better assess each of them and their strengths and weaknesses, fostering a greater development in the research on these fields. Thus, FlaReNet WG6 recommends to support the development of evaluation methods that cover the different techniques in order to allow a better testing of existing and newly discovered techniques. For the next version of the document, information about evaluation as supplied by relevant actors in a workshop, to be organized, will be taken into account. The outcome of the workshop shall be:
1) A strategy for assessing the potential and future impact of acquisition techniques, 2) Plans for creating evaluation campaigns for the different techniques. Proposals to
already existing evaluation forums (CLEF, CONLL, etc.). 3) Provision of evaluation materials. EU projects working in this area (that will be
invited to present their projects) will be asked to share their evaluation materials as to start the standardization of evaluation methods in the areas addressed.
Recommendation 2. A common and standard evaluation procedure, taking into ac‐count normalization for comparison, is still missing and highly needed. Moreover, be‐yond the evaluation on scientific grounds, we also recommend that techniques are measured by their impact in real scenarios of NLP applications. Special workshops and discussion forums will have to be organized in order to agree on that common and stan‐dard methods. As part of the strategy proposed here, FLaReNet WG6 will make a pro‐posal for a special Workshop on “Techniques for the automatic production of language resources and their evaluation methods”. We will also establish contacts with future projects in this area (e.g. those funded by 7FP 4th call on language resources). Recommendation 2.1. Given that some of the resources produced are used to derivate new ones, the quality of the base materials is crucial to achieve good derivatives. Evalua‐tion of the results of automatic techniques must also foresee complex scenarios where the quality of the final results depends on the quality of the partial results. The fourth conclusion is related to this latter aspect of evaluation materials. Much of the research on acquisition has focussed on small‐scale experiments and therefore their us‐ability in applications is largely to be demonstrated yet. While evaluation referring to in‐ternal improvements is easy to plan, evaluation with respect to quality and usability of the resources produced in relation with final applications is, as mentioned before, very important as well. It has been very difficult to find information about the characteristics of the language resources that industrial applications use, as well as about the size and granularity of the information contained.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
32/46
Recommendation 3. It is urgent to address actual industrial needs in the research agenda. The information about whether the resources acquired are actually used and, the other way around, what are the particular characteristics of the actually used re‐sources has to be made public. The involvement of the industries themselves in the re‐search on automatic methods must be supported. Another very salient aspect out of our study, as reported before and although somehow beyond the scope of this document, is the lack of documentation and clear information about resources and related technologies. Recommendation 4. It would be necessary to harmonize the scenario with particular metadata and common vocabulary of categories that describe resources and means to acquire them, as this would not only facilitate the kind of surveys reported here, but also would improve and enrich the LR research field. FlaReNet WG6 will take also this issue into account when proposing the development of common methods for evaluating tech‐niques and resources. Finally, we want to remark, here again, that our study has not yet addressed speech re‐sources nor sign languages. These issues will be addressed in the next version of this re‐port. Also, we want to point out the that the survey on availability of language resources we have reported in section 3 can be seen as complementary to deliverable D2.1, and that has been reported here because it was performed from the perspective of automatic means of producing, enriching and maintaining language resources.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
33/46
5 Bibliography
This is intended to be an initial reference bibliography for automatic methods of LR pro‐duction.
5.1 Books
Jurafsky, D. and Martin, J.H. (2008). Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2nd ed. Upper Saddle River, New Jersey: Prentice Hall.
Manning, C.D. and Schütze, H. (2003). Foundations of Statistical Natural Language Processing. Sixth printing with corrections. Cambridge, Massachusetts: The MIT Press.
5.2 Articles (besides those describing the LR studied)
Abend, O., Reichart, R. and Rappoport, A. A supervised algorithm for verb disambiguation into VerbNet classes. In: 22nd International Conference on Computational Lin‐guistics, Manchester, 2008.
Alonso Ramos, M., Rambow, O. and Wanner, L. Using Semantically Annotated Corpora to Build Collocation Resources. In: 6th International Conference on Language Re‐sources and Evaluation, Marrakech, 2008.
Apidianaki, M. Translationoriented Word Sense Induction Based on Parallel Corpora. In: 6th International Conference on Language Resources and Evaluation, Marra‐kech, 2008.
Baker, K. and Brew, C. Statistical Identification of English Loanwords in Korean Using Automatically Generated Training Data. In: 6th International Conference on Lan‐guage Resources and Evaluation, Marrakech, 2008.
Baldwin, T. and Bond, F. Learning the Countability of English Nouns from Corpus Data. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, 2003.
Baldwin, T. Bootstrapping Deep Lexical Resources: Resources for Courses, In: ACL‐SIGLEX 2005. Workshop on Deep Lexical Acquisition. Ann Arbor, Michigan, 2005a.
Baldwin, T. GeneralPurpose Lexical Acquisition: Procedures, Questions and Results, In: Proceedings of the Pacific Association for Computational Linguistics, Tokyo, 2005b.
Banea, C., Mihalcea, R. and Wiebe, J. A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Baroni, M. and Bernardini, S. BootCaT: Bootstrapping Corpora and Terms from the Web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, 2004.
Baroni, M. and Bernardini, S. BootCaT: Bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, 2004.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
34/46
Bel, N., Espeja, S. and Marimon, M. Automatic Acquisition of Grammatical Types for Nouns. In HLT 2007: The Conference of the North American Chapter of the ACL. Com‐panion Volume, Short Papers. Rochester, New York, 2007.
Bel, N., Espeja, S., Marimon, M. Automatic acquisition for low frequency lexical items In: Calzolari, Nicoletta et al. (eds.) Proceedings of the Sixth International Confer‐ence on Language Resources and Evaluation. Paris: European Language Resources Asso‐ciation, 2008.
Bestgen, Y. Building Affective Lexicons from Specific Corpora for Automatic Sentiment Analysis. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Binnenpoorte, D., Cucchiarini, C., D’Halleweyn, E., Sturm, J., De Vriend, F. Towards a Roadmap for Human Language Technologies: DutchFlemish experience. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, 2002.
Binnenpoorte, D., De Vriend, F., Sturm, J., Daelemans, W., Strik, H., Cucchiarini, C. A Field Survey for Establishing Priorities in the Development of HLT Resources for Dutch. In: Proceedings of the 3rd International Conference on Language Resources and Evalua‐tion, Las Palmas, Spain, 2002.
Blunsom, P. and Baldwin, T. Multilingual Deep Lexical Acquisition for HPSGs via Supertagging. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), Sydney, 2006.
Blunsom, P., Cohn, T., Dyer, C. and Osborne, M. A Gibbs Sampler for Phrasal Synchronous Grammar Induction. In: 47th Annual Meeting of the Association for Computa‐tional Linguistics, Singapore, 2009.
Bond, F., Isahara, H., Kanzaki, K. and Uchimoto, K. BootStrapping a WordNet Using Multiple Existing WordNets. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Borin, L., Forsberg, M., Lönngren, L. “The Hunting of the BLARK – SALDO, a Freely Available Lexical Database for Swedish Language Technology”. In: Resourceful language technology. Festschrift in honor of Anna Sågvall Hein. (2008). Göteborg: University of Gothenburg. 21‐32.
Boulaknadel, S., Daille, B. and Aboutajdine, D. A MultiWord Term Extraction Program for Arabic Language. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Brent, M. R. Surface cues and robust inference as a basis for the early acquisition of subcategorization frames. Lingua 92:433–470, 1994.
Briscoe, T. and Carroll, J. Automatic extraction of subcategorization from corpora. In: Proceedings of the Fifth Conference on Applied Natural Processing, Washington, 1997.
Brody, S. and Lapata, M. Bayesian Word Sense Induction. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
35/46
Brody, S. and Lapata, M. Good neighbors make good senses: exploiting distributional similarity for unsupervised WSD. In: 22nd International Conference on Computa‐tional Linguistics, Manchester, 2008.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263‐311.
Brown, R. D. Automated Dictionary Extraction for “KnowledgeFree” ExampleBased Translation. In: Proceedings of the 7th International Conference on Theoretical and Methodological Issues in Machine Translation, Santa Fe, New Mexico, 1997.
Butters, J. and Ciravegna, F. Using Similarity Metrics for Terminology Recognition. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Carroll, J. and Fang, A. The automatic acquisition of verb subcategorisations and their impact on the performance of an HPSG parser. In: Proceedings of the 1st Interna‐tional Joint Conference on Natural Language Processing (IJCNLP), Sanya City, 2004.
Carrrera, J., Castellón, I., Climent, S. and Coll‐Florit, M. Towards Spanish Verbs’ Selectional Preferences Automatic Acquisition: Semantic Annotation of the SenSem Corpus. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Chakrabarti, D., Mandalia, H., Priya, R., Sarma, V. and Bhattacharyya, P. Hindi compound verbs and their automatic extraction. In: 22nd International Conference on Com‐putational Linguistics, Manchester, 2008.
Charniak, E. and Berland, M. Finding parts in very large corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 57–64, Maryland, 1999.
Cohen, S. and Smith, N. A. Variational Inference for Grammar Induction with Prior Knowledge. In: 47th Annual Meeting of the Association for Computational Linguistics, Singapore, 2009.
Cromieres, F. and Kurohashi, S. An Alignment Algorithm using Belief Propagation and a Structurebased Distortion Model. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Cuadros, M. and Rigau, G. KnowNet: building a large net of knowledge from the Web. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Cucchiarini, C., Daelemans W. and Strik, H. Strengthening the Dutch Human Language Technology Infrastructure, in ELRA Newsletter Vol. 6 N. 4. 2001.
Dagan, I., Church, K. W., Gale, W. A. Robust bilingual word alignment for machine aided translation. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, 1‐8, Columbus, 1993.
Dagan, I., Itai, A., Schwall, U. Two languages are more informative than one. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguis‐tics, Berkeley, California, 1991.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
36/46
Deng, Y. and Byrne, W. MTTK: An alignment toolkit for statistical machine translation. Presented in the HLT‐NAACL Demonstrations Program, June 2006.
Deoskar, T. and Rooth, M. Induction of TreebankAligned Lexical Resources. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Dias‐da‐Silva, B. C., Di Felippo, A. and das G. Volpe Nunes, M. The Automatic Mapping of Princeton WordNet LexicalConceptual Relations onto the Brazilian Portuguese WordNet Database. In: 6th International Conference on Language Resources and Evalua‐tion, Marrakech, 2008.
Evert, S. A Lightweight and Efficient Tool for Cleaning Web Pages. In: 6th Interna‐tional Conference on Language Resources and Evaluation, Marrakech, 2008.
Fujii, Atsushi and Tetsuya Ishikawa. Utilizing the world wide web as an encyclopedia: Extracting term descriptions from semistructured text. In: Proceedings of the 38th Meeting of the Association for Computational Linguistics, Hong Kong, 2000.
Fujita, A. and Sato, S. A probabilistic model for measuring grammaticality and similarity of automatically generated paraphrases of predicate phrasen. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Fukumoto, F., Suzuki, Y. and Yamashita, K. Retrieving bilingual verbnoun collocations by integrating crosslanguage category hierarchies. In: 22nd International Confer‐ence on Computational Linguistics, Manchester, 2008.
Fung, P. Compiling bilingual lexicon entries from a nonparallel English Chinese corpus, In Proceedings of the Third Annual Workshop on Very Large Corpora, Boston, Massachusetts, 1995.
Gale, W. A. and Church, K. W. Identifying word correspondences in parallel texts, Fourth DARPA Workshop on Speech and Natural Language. Asilomar, CA, pp. 152–157, 1991.
Gale, W. A., Church, K. W. and Yarowsky, D. One sense per discourse. In: Proceed‐ings of the Speech and Natural Language Workshop, San Francisco, 1992.
Ganchev, K., Gillenwater, J. and Taska, B. Dependency Grammar Induction via Bitext Projection Constraints. In: 47th Annual Meeting of the Association for Computa‐tional Linguistics, Singapore, 2009.
Ganchev, K., Graça, J. V. and Taskar, B. Better Alignments = Better Translations?. In: 46th Annual Meeting of the Association for Computational Linguistics, Columbus, 2008.
Graca, J., Pardal, J. P., Coheur, L. and Caseiro, D. Building a Golden Collection of Parallel MultiLanguage Word Alignment. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Ha, L. A., Fernandez, G., Mitkov, R. and Corpas, G. Mutual Bilingual Terminology Extraction. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Haddow, B. and Alex, B. Exploiting Multiply Annotated Corpora in Biomedical Information Extraction Tasks. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
37/46
Haghighi, A., Liang, P., Berg‐Kirkpatrick, T., Klein, D. Learning Bilingual Lexicons from Monolingual Corpora. In: 46th Annual Meeting of the Association for Computational Linguistics, Columbus, 2008.
Halácsy, P., Kornai, A., Németh, P. and Varga, D. Parallel Creation of Gigaword Corpora for Medium Density Languages an Interim Report. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Harris, Z. Distributional structure. Word, 10(23): 146‐162, 1954. Heid, U. and Weller, M. Tools for Collocation Extraction: Preferences for Active vs.
Passive. In: 6th International Conference on Language Resources and Evaluation, Marra‐kech, 2008.
Hiemstra, D. Deriving a bilingual lexicon for cross language information retrieval, In: Proceedings of the Fourth Groningen International Information Technology Confer‐ence for Students. Groningen: University of Groningen, 1997.
Huang, C., Lee, L., Hog, J., Qu W. and Yu, S. Quality Assurance of Automatic Annotation of Very Large Corpora: a Study based on heterogeneous Tagging System. In: 6th Inter‐national Conference on Language Resources and Evaluation, Marrakech, 2008.
Ienco, D., Villata, S. and Bosco, C. Automatic Extraction of Subcategorization Frames for Italian. In: 6th International Conference on Language Resources and Evalua‐tion, Marrakech, 2008.
Itamar, E. and Itai, A. Using Movie Subtitles for Creating a LargeScale Bilingual Corpora. In: 6th International Conference on Language Resources and Evaluation, Mar‐rakech, 2008.
Jijkoun, V. and Hofmann, K. Generating a NonEnglish Subjectivity Lexicon: Relations That Matter. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Jones, Rosie and Rayid Ghani. Automatically building a corpus for a minority language from the web. In: Proceedings of the Student Workshop of the 38th Annual Meet‐ing of the Association for Computational Linguistics, Hong Kong, 2000.
Kaji, H., Tamamura, S. and Erdenebat, D. Automatic Construction of a JapaneseChinese Dictionary via English. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Kanzaki, K., Bond, F., Tomuro, N. and Isahara, H. Extraction of Attribute Concepts from Japanese Adjectives. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Kawahara, D. and Uchimoto, K. A Method for Automatically Constructing Case Frames for English. In: 6th International Conference on Language Resources and Evalua‐tion, Marrakech, 2008.
Ker, S. J. and Chang, J. S. A ClassBased Approach to Word Aligmnent. Computa‐tional Linguistics, 23(2): 313‐343, 1997.
Kilgarriff, A. and Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. In: Computational Linguistics, 29:333–347.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
38/46
Kilgarriff, A. I don’t believe in word senses, Computers and the Humanities, Volume 31 (2), 1997.
Korhonen, A. Subcategorization acquisition. Technical Report: UCAM‐CL‐TR‐530, University of Cambridge, UK, 2002.
Korhonen, A., Krymolowski, Y. and Collier, N. The choice of features for classification of verbs in biomedical texts. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Korhonen, A., Krymolowski, Y. and Collier, N. The Choice of Features for Classification of Verbs in Biomedical Texts. In: Proceedings of Coling 2008. Manchester, 2008.
Krauwer, S. The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap. In: International Workshop Speech and Computer (SPECOM), Moscow, Russia, 27‐29 October 2003.
Kupiec, J. An Algorithm for finding noun phrase correspondences in bilingual corpora, In: Proceedings of the 31st Annual Meeting on Association for Computational Lin‐guistics, 1993.
Landauer, T. K., Laham, D., Rehder, R. and Schreiner, M. E. How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. In: Proceedings of the 19th Annual Conference of the Cognitive Science So‐ciety, pages 412–417, Mahwah, NJ, 1997.
Lapshinova‐Koltunski, E. and Heid, U. Head or Nonhead? Semiautomatic Procedures for Extracting and Classifying Subcategorisation Properties of Compounds. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Lardilleux, A. and Lepage, Y. Multilingual alignments by monolingual string differences. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Lefever, E., Macken, L. and Hoste, V. LanguageIndependent Bilingual Terminology Extraction from a Multilingual Parallel Corpus. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Lemnitzer, L., Wunsch, H. and Gupta, P. Enriching GermaNet with verbnoun relations a case study of lexical acquisition. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Lenci, A., McGillivray, B., Montemagni, S. and Pirrelli, V. Unsupervised Acquisition of Verb Subcategorization Frames from ShallowParsed Corpora. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Li, C., He, X., Liu, Y. and Xi, N. Incremental HMM Alignment for MT System Combination. In: 47th Annual Meeting of the Association for Computational Linguistics, Singa‐pore, 2009.
Lin, D., Zhao, S., Van Durme, B., Pasca, M. Mining Parenthetical Translations from the Web by Word Alignment. In: 46th Annual Meeting of the Association for Computa‐tional Linguistics, Columbus, 2008.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
39/46
Liu, Y., Liu, Q. and Lin, S. Loglinear models for word alignment. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, Ann Arbor, 2005.
Lund, K., Burgess, C. et al. Semantic and Associative Priming in High Dimensional Semantic Space". In: Proceedings of the Seventeenth Annual Conference of the Cognitive Science Society. Erlbaum, Mahwah NJ, 1995.
Ma, Y., Ozdowska, S., Sun, Y., Way, A. Improving Word Alignment Using Syntactic Dependencies. Proceedings of the Second Workshop on Syntax and Structure in Statisti‐cal Translation (SSST‐2), Columbus, 2008.
Ma, Y., Stroppa, N. and Way, A. Bootstrapping Word Alignment Via Word Packing. In: Proceedings of the 45th Annual Meeting of the Association for Computational Lin‐guistics, Prague, 2007.
Macken, L., Lefever, E. and Hoste, V. Linguisticallybased subsentential alignment for terminology extraction from a bilingual automotive corpus. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Maeda, K., Ma, X. and Strassel, S. Creating SentenceAligned Parallel Text Corpora from a Large Archive of Potential Parallel Text using BITS and Champollion. In: 6th Inter‐national Conference on Language Resources and Evaluation, Marrakech, 2008.
Maegaard, B., Krauwer, S., Choukri, K., Damsgaard Jørgensen, L., The BLARK concept and BLARK for Arabic. In: Proceedings of the 5th International Conference on Lan‐guage Resources and Evaluation, Genoa, 2006.
Manning, C. D. Automatic acquisition of a large subcategorization dictionary from corpora. In: Proceedings of the 31st ACL, pp. 235‐242, 1993.
Mapelli, V. and Choukri, K. Report on a (minimal) set of LRs to be made available for as many languages as possible, and map of the actual gaps. In: ENABLER, Deliverable D5.1, Paris, 2003.
Mausam, S. Soderland, O. Etzioni, D. Weld, Skinner, M. and Bilmes, J. Compiling a Massive, Multilingual Dictionary via Probabilistic Inference. In: 47th Annual Meeting of the Association for Computational Linguistics, Singapore, 2009.
Melamed, D. I. Models of translational equivalence among words. Computational Linguistics, 26(2):221‐249, 2000.
Melamed, I. D. A WordtoWord Model of Translationan Equivalence. In: Proceed‐ings of the 35th Conference of the Association for Computational Linguistics, Madrid, 1997.
Michou, A. and Seretan, V. A Tool for MultiWord Expression Extraction in Modern Greek Using Syntactic Parsing. In: 12th Conference of the European Chapter of the Asso‐ciation for Computational Linguistics, Athens, 2009.
Mohanty, R. and Bhattacharyya, P. Lexical Resources for Semantics Extraction. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Mohler, M. and Mihalcea, R. Babylon Parallel Text Builder: Gathering Parallel Texts for LowDensity Languages. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
40/46
Moore, R. C. A discriminative framework for bilingual word alignment. In: Proceed‐ings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, 2005.
Morin, E., Daille, B., Takeuchi, K., Kageura, K. Bilingual Terminology Mining – Using Brain, not brawn comparable corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Prague, 2007.
Nazar, R., Vivaldi, J. and Cabré, T. A Suite to Compile and Analyze an LSP Corpus. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Nerima, L. and Wehrli, E. Generating Bilingual Dictionaries by Transitivity. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Och, F. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19‐51.
Padó, S., Pennacchiotti, M. and Sporleder, C. Semantic role assignment for event nominalisations by leveraging verbal data. In: 22nd International Conference on Compu‐tational Linguistics, Manchester, 2008.
Pasca, M. Outclassing Wikipedia in OpenDomain Information Extraction: WeaklySupervised Acquisition of Attributes over Conceptual Hierarchies. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Paukkeri, M., Nieminen, I., Pölä, M. and Honkela, T. A languageindependent approach to keyphrase extraction and evaluation. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Pazienza, M. T. and Stellato, A. Clustering of Terms from Translation Dictionaries and Synonyms Lists to Automatically Build more Structured Linguistic Resources. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Pereira, F., Tishby, N. Distributional Similarity, Phrase Transitions and Hierarchical Clustering. In: Working Notes, Fall Symposium Series. AAAI pp.108‐112, 1992.
Pitel, G. and Grefenstette, G. Semiautomatic Building Method for a Multidimensional Affect Dictionary for a New Language. In: 6th International Conference on Lan‐guage Resources and Evaluation, Marrakech, 2008.
Potrich, A. and Pianta, E. LISA: Learning Domain Specific IsaRelations from the Web. In: 6th International Conference on Language Resources and Evaluation, Marra‐kech, 2008.
Preiss, J., Briscoe, T. and Korhonen, A. A System for Largescale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Prague, 2007.
Prys, D. The BLARK Matrix and its relation to the language resources situation for the Celtic languages. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, Genoa, 2006.
Quochi, V., Monachini, M., Del Gratta, R., Calzolari, N. A lexicon for biology and bioinformatics: the BOOTStrep experience. In: 6th International Conference on Language Re‐sources and Evaluation, Marrakech, 2008.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
41/46
Rahul, B., Ravichandran, D. Large Scale Acquisition of Paraphrases for Learning Surface Patterns. In: 46th Annual Meeting of the Association for Computational Linguis‐tics, Columbus, 2008.
Rao, D. and Ravichandran, D. SemiSupervised Polarity Lexicon Induction. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Rodríguez, H., Farwell, D., Farreres, J., Bertran, M., Alkhalifa, M. and Martí, M. A. Arabic WordNet: Semiautomatic Extensions using Bayesian Inference. In: 6th Interna‐tional Conference on Language Resources and Evaluation, Marrakech, 2008.
Roth, M. and Schulte im Walde, S. Corpus CoOccurrence, Dictionary and Wikipedia Entries as Resources for Semantic Relatedness Information. In: 6th International Confer‐ence on Language Resources and Evaluation, Marrakech, 2008.
Schulte im Walde, S. Evaluating verb subcategorization frames learned by a German statistical grammar against manual definitions in the Duden Dictionary. In: Proceed‐ings of the 10th EURALEX International Congress, 187‐197, 2002.
Shimizu, N., Hagiwara, M., Ogawa, Y., Toyama, K. and Nakagawa, H. Metric learning for synonym acquisition. In: 22nd International Conference on Computational Lin‐guistics, Manchester, 2008.
Shnarch, E., Barak, L. and Dagan, I. Extracting Lexical Reference Rules from Wikipedia. In: 47th Annual Meeting of the Association for Computational Linguistics, Singapore, 2009.
Smadja, F., McKeown, K. R. and Hatzivassiloglou, V. Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1):1‐38, 1996.
Snyder, B., Naseem, T. and Barzilay, R. Unsupervised Multilingual Grammar Induction. In: 47th Annual Meeting of the Association for Computational Linguistics, Singa‐pore, 2009.
Sporleder, C. and Li, L. Unsupervised Recognition of Literal and NonLiteral Use of Idiomatic Expressions. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Stevenson, M., Guo, Y. and Gaizauskas, R. Acquiring sense tagged examples using relevance feedback. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Strik, H., Daelemans, W., Binnenpoorte, D., Sturm, J., de Vriend, F. and Cucchiarini, C. Dutch HLT resources: From BLARK to priority lists. In: Proceedings of ICSLP, Denver, USA, 2002.
Sujay Carlos, C., Choudhury, M. and Dandapat, S. LargeCoverage Root Lexicon Extraction for Hindi. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Thompson, P., Cotter, P., McNaught, J., Ananiadou, S., Montemagni, S., Trabucco, A. and Venturi, G. Building a BioEvent Annotated Corpus for the Acquisition of Semantic Frames from Biomedical Corpora. In: 6th International Conference on Language Re‐sources and Evaluation, Marrakech, 2008.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
42/46
Tiedemann, J. Extraction of translation equivalents from parallel corpora, In: Pro‐ceedings of the 11th Nordic Conference on Computational Linguistics NODALI98, 1998.
Tinsley, J., Zhechev, V., Hearne, M. and Way, A. Robust Language PairIndependent SubTree Alignment. In: Machine Translation Summit XI, pp. 467‐474, Copenhagen, 2007.
Todiraşcu, A., Tufiş, D., Heid, U., Gledhill, C., Ştefanescu, D., Weller, M. and Rous‐selot, F. A Hybrid Approach to Extracting and Classifying Verb+Noun Constructions. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Tohyama, H., Kozawa, S., Uchimoto, K., Matsubara, S., Isahara, H. SHACHI: A Large Scale Metadata Database of Language Resources. In: The First International Conference on Global Interoperability for Language Resources, Hong Kong, 2008.
Tsunakawa, T., Okazaki, N. and Tsujii, J. Building a bilingual lexicon using phrasebased statistical machine translation via a pivot language. In: 22nd International Confer‐ence on Computational Linguistics, Manchester, 2008.
Tufis, D. A cheap and fast way to build useful translation lexicons, In: Proceedings of the 19th international conference on Computational linguistics, 2002.
Tufiş, D., Irimia, E., Ion, R. and Ceauşu, A. Unsupervised Lexical Acquisition for Part of Speech Tagging. In: 6th International Conference on Language Resources and Evalua‐tion, Marrakech, 2008.
Ushioda, A., Evans, D., Gibson, T. and Waibel, A. The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora. In: B. Boguraev and J. Pustejovsky, eds. SIGLEX ACL Workshop on the Acquisition of Lexical Knowledge from Text. Columbus, Ohio: 95—106, 1993.
Van Durme, B., Qian, T. and Schubert, L. Classdriven attribute extraction. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Varga, I. and Yokoyama, S. iChi: a bilingual dictionary generating tool. In: 47th An‐nual Meeting of the Association for Computational Linguistics, Singapore, 2009.
Vaz, P. C., Martins de Matos, D. and Mamede, N. J. Using Lexical Acquisition to Enrich a Predicate Argument Reusable Database. In: 6th International Conference on Lan‐guage Resources and Evaluation, Marrakech, 2008.
Veale, T. and Hao, Y. Acquiring Naturalistic Concept Descriptions from the Web. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Vintar, Š. and Fišer, D. Harvesting MultiWord Expressions from Parallel Corpora. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Vivaldi, J., Joan, A. and Lorente, M. Turning a Term Extractor into a new Domain: first Experiences. In: 6th International Conference on Language Resources and Evalua‐tion, Marrakech, 2008.
Vogel, S., Ney, H. and Tillmann, C. HMMbased word alignment in statistical translation. In: Proceedings of the 16th International Conference on Computational Linguis‐tics, Copenhagen, 1996.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
43/46
Vu, T., Ti, A. and Zhang, M. Featurebased Method for Document Alignment in Comparable News Corpora. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Wan, X. and Xiao, J. CollabRank: towards a collaborative approach to singledocument keyphrase extraction. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Washtell, J. Codispersion: A Windowless Approach to Lexical Association. In: 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009.
Wentland, W., Knopp, J., Silberer, C. and Hartung, M. Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration. In: 6th In‐ternational Conference on Language Resources and Evaluation, Marrakech, 2008.
Wu, D. Stochastic Inversion Transduction Grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377‐403, 1997.
Yamada, K. and Knight, K. A Syntaxbased Statistical Translation Model. In: Pro‐ceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, 2001.
Yan, Y., Okazaki, N., Matsuo, Y., Yang, Z. and Ishizuka, M. Unsupervised Relation Extraction by Mining Wikipedia Texts Using Information from the Web. In: 47th Annual Meeting of the Association for Computational Linguistics, Singapore, 2009.
Yang, Y., Lu Q., and Zhao, T. Chinese Term Extraction Based on Delimiters. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Yang, Y., Lu, Q. and Zhao, T. Chinese term extraction using minimal resources. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
Yang, Y., Zhao, T., Lu, Q., Zheng, D. and Yu, H. Chinese Term Extraction Using Different Types of Relevance. In: 47th Annual Meeting of the Association for Computational Linguistics, Singapore, 2009.
Zesch, T., Müller, C. and Gurevych, I. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Zhang, Z.; Iria, J., Brewster, C. and Ciravegna, F. A Comparative Evaluation of Term Recognition Algorithms. In: 6th International Conference on Language Resources and Evaluation, Marrakech, 2008.
Zhechev, V. and Way, A. Automatic generation of parallel treebanks. In: 22nd In‐ternational Conference on Computational Linguistics, Manchester, 2008.
Zwarts, S. and Dras, M. Choosing the right translation: a syntactically informed classification approach. In: 22nd International Conference on Computational Linguistics, Manchester, 2008.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
44/46
5.3 Electronic Resources (besides all the individual websites of the LR studied)
- 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, 2006: http://eacl06.itc.it/programme/conference_glance.htm
- 12th Conference of the European Chapter of the Association for Computational Linguistics, Athens, 2009: http://www.eacl2009.gr/conference/programmeofconference1
- 22nd International Conference on Computational Linguistics, Manchester, 2008: http://www.coling2008.org.uk/
- 45th Annual Meeting of the Association for Computational Linguistics, Prague, 2007: http://ufal.mff.cuni.cz/acl2007/
- 46th Annual Meeting of the Association for Computational Linguistics, Columbus, 2008: http://www.ling.ohio‐state.edu/acl08/
- 47th Annual Meeting of the Association for Computational Linguistics, Singapore, 2009: http://www.acl‐ijcnlp‐2009.org/
- 5th International Conference on Language Resources and Evaluation, Genoa, 2006: http://www.lrec‐conf.org/lrec2006/
- 6th International Conference on Language Resources and Evaluation, Marrakech, 2008: http://www.lrec‐conf.org/lrec2008/
- Australian National Data Service: http://ands.org.au - Axquilex Project: http://www.cl.cam.ac.uk/research/nl/acquilex/ - BLaRK: Basic Language Resource Kit:
http://lands.let.ru.nl/~strik/research/BLaRK.html - BootCat: http://sslmit.unibo.it/~baroni/bootcat.html - CLARIN Catalogue (Resources): http://www.clarin.eu/view_resources - CLARIN Catalogue (Tools): http://www.clarin.eu/view_tools - ELRA Catalogue: http://catalog.elra.info/search.php - Joint conference of the International Committee on Computational Linguistics
and the Association for Computational Linguistics, Sydney, 2006: http://www.aclweb.org/mirror/acl2006/
- Language Grid: http://langrid.nict.go.jp/en/index.html - MEANING Project: http://www.lsi.upc.es/~nlp/meaning/meaning.html - New Text workshop: http://www.sics.se/jussi/newtext/ - Project Bamboo: http://projectbamboo.org - SHACHI (Language Resources Metadata Database):
http://facet.shachi.org/?ln=en - Sparkle Project: http://www.ilc.cnr.it/sparkle/sparkle.html - Verbmobil Project: http://verbmobil.dfki.de/overview‐us.html
D6.1a – Survey and assessment of methods for the automatic construction of LRs
45/46
APPENDIX I: Some remarks on the description and standarization of resources nowadays
As a result of the survey we carried out, we realized several facts on the description and standarization of resources. In what follows we offer a brief summary of the remarks we consider worth mentioning as additional information. The results of our survey pointed out several problems in the description and standari‐zation of resources nowadays that FLARENET should also address. Nowadays there is a clear lack of a homogeneous and consistent description of language resources. Even though iniciatives like CLARIN in Europe, BAMBOO17 in the United States, ANDS18 in Australia and Language Grid19 and SHACHI in Japan are trying to address these issues and are trying to or have already come up with usefull metadata that appropriately de‐scribe the language resources, this problem is still to be tackled by the research commu‐nity. In this sense, FLARENET should also contribute to the discussion and facilitate relevant input to the metadata issues being currently discussed. As it was clearly seen during the survey, data like size (i.e. number of words/minutes, etc.), languages covered, annota‐tion, type of annotation, use of standards, etc. should be standardised and somehow fixed in order to achieve an homogeneous description of all language resources which facilitates both browsing the different existing catalogues and locating the resources that a researcher needs to carry out his/her research appropriately. Another important issue is the fact that not every language resource produced is easily found or accessible and what is even worse, some resources are lost after several years have elapsed since their completion because no further maintenance is made. It is often the case that a resource is referenced to in a catalogue or paper at a conference, but it is not possible to find it in the net or simply their website has disappeared as the project already finished and noone has taken care of it ever since then. If we aim at creating a Language Resources Net, we shall ensure that persistence is guaranteed and that re‐sources and the documentation thereof are available for all potential users in the long run. Finally, and as far as standards are concerned, our survey has revealed that there is a clear lack of documentation on this matter. Language resource providers are either un‐aware of the existence of standards and therefore don’t use them or they do not make the appropriate references to the standards used in their resources and thus the infor‐mation on standards is not appropriately documented. While there is a clear tendency to use standard formats such as XML, txt, etc, the standards available in linguistics matters are either not used because other tagsets and annotation formats are preferred or be‐cause they are too ambiguous or difficult to be applied. An effort should be made in the linguists community to make standards such as LMF, LAF, MAF, etc. known and used ap‐propriately. In our survey, out of 728 resources only 48 (6,59%) use as annotation for‐mats LAF (1), LMF (3) or TEI (44) and only 22 (3%) follow guidelines such as CES (6), XCES (8) or EAGLES (8). This fact is a clear indicator that a special emphasis on stan‐ 17 http://projectbamboo.org/. 18 http://ands.org.au/. 19 http://langrid.nict.go.jp/en/index.html.
D6.1a – Survey and assessment of methods for the automatic construction of LRs
46/46
darization of language resources shall be done and that new resources should be created using the already existing standards.