corpeus, a ‘web as corpus’ tool designed for the agglutinative nature of basque

62
CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque I. Leturia, A. Gurrutxaga 1 , I. Alegria, A. Ezeiza 2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country 2 IXA Group, University of the Basque Country, Donostia, Basque Country

Upload: zia-rodriguez

Post on 03-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. I. Leturia, A. Gurrutxaga 1 , I. Alegria, A. Ezeiza 2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

CorpEus, a ‘web as corpus’ tooldesigned for the

agglutinative nature of Basque

I. Leturia, A. Gurrutxaga1,I. Alegria, A. Ezeiza2

WAC3 – September 15-16, 2007 – Louvain-la-Neuve

1 Elhuyar R&D, Usurbil, Basque Country2 IXA Group, University of the Basque Country, Donostia, Basque Country

Page 2: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 3: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 4: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• No doubt corpora are necessary:– for linguistic research– for language normalization– for developing language technologies

• But many corpora are exclusively used for these purposes

• They are not made publicly available and searchable through the Internet

Page 5: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• For Basque, it is essential to have corpora available for querying– Standardization of Basque started only in 1968– Many rules, words and spellings have been changing

since; still, every now and then new rules are released by the Academy of Basque Language

– It was not taught in schools until the seventies and in universities until the eighties

– No decision as to the correct word or spelling has yet been taken in many areas or words

– Even written production abounds with misspellings, errors, uncertainties, etc.

Page 6: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• Basque speaking community needs corpora– Teachers– Writers– Technical text producers– Dictionary makers– Translators– Students– Academics in the field of standardization

• Basque is not a language rich in corpora– Few, small and not updated

Page 9: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• Only corpora available (III):– Zientzia eta teknologiaren corpusa:

• Elhuyar Foundation and the IXA Group of the University of the Basque Country

• 7.6 million words• Texts on science and technology• 1990 - 2002• http://www.ztcorpusa.net

Page 10: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• Only corpora available (IV):– Klasikoen gordailua:

• Susa publishing house• 10.7 million words• Non-tagged• Classic texts• http://klasikoak.armiarma.com/corpus.htm

Page 11: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• But we do have the Internet– Huge repository of texts– Constantly updated

• A tool for querying the Internet as if it were a Basque corpus would be very interesting

Page 12: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• Also disadvantages:– Not linguistically tagged:

• Always some uncertainty• Variants and misspellings will not appear when

looking for a word

– It will never show all, only what there is in the first results returned by search engines

– The Internet is often considered non-representative

– The Internet is full of redundancy

Page 13: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Motivation

• Nevertheless, we thought that the benefits far exceeded the disadvantages

• We embarked on a project to build a ‘web as corpus’ tool for Basque

Page 14: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 15: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 17: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Problems with Basque language

• Looking for conjugations and inflections– Basque is an agglutinative language

• A given lemma makes many different word forms– lan (“work”): lana (“the work”), lanak (“works” or “the

works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)…

– Looking only for the exact given word, or the word plus an “s” for the plural, is not enough

– Wildcards are not an appropriate solution• Looking for lan* would also return forms of the

words lanabes (“tool”), lanbro (“fog”)…

Page 18: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Problems with Basque language

• Language discrimination– No search engine offers the possibility of

returning only pages in Basque– Big problem when looking for:

• Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia…

• Short words: katu (“cat”), ur (“water”)…• Proper nouns: Egipto, Newton, Pluton…

– Many non-Basque results are returned, often no Basque results at all

Page 19: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Problems with Basque language

• Lack of knowledge about the language– Status of language:

• Late standardization• Still many changes in words and rules• Late teaching in schools and universities• Many non-standardised areas or words• Many misspellings and errors in written production

– A word might be incorrect but appear often in the web

– The user might think it is correct, without knowing that a more appropriate word exists

Page 20: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 21: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 22: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Our approach

• Looking for conjugations and inflections: Morphological query expansion (I)– Morphological generator created by the IXA

Group of the University of the Basque Country– We obtain all the forms of a given lemma– We ask the search engine for all of them using

an OR operator– etxe (“house”) => etxe OR etxea OR etxeak

OR etxeari OR etxeek OR …

Page 23: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Our approach

• Looking for conjugations and inflections: Morphological query expansion (II)– Little problems:

• The APIs of the search engines have each a limit in number of words or length of search phrase

– we had to discover the limits by trial and error

• Due to these limits, real lemmatised search is impossible

– we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words

– these are the forms of the words sent in the query

Page 24: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Our approach

• Language discrimination:Language-filtering words (I)– We looked in a corpus for the most frequent

words in Basque– We include them in the search phrase using an

AND operator

Page 25: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Our approach

• Language discrimination:Language-filtering words (II)– Little problems (I):

• The most frequent words in Basque exist in other languages too

• Several language-filtering words had to be used– the more of these, the more we gained in precision (fewer

non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa

– we chose precision and include four filtering words– if few results are returned, the user can try again

increasing the recall

Page 26: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Our approach

• Language discrimination:Language-filtering words (III)– Little problems (II):

• In bilingual pages, the searched word can be in a piece of text that is not in Basque

– LangId, a free language identifier developed by the IXA Group of the University of the Basque Country

– applied to some context around the words to see if it is in a piece of text in Basque

– it does not work well with small contexts, but if the context is too big pieces in other languages can be included

– we start with quite a broad context and progressively reduce its length until minimal length for LangId to work properly is reached

– if at any time LangId says it is in Basque, we stop and we show it

Page 27: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Lack of knowledge about the language:Variant suggestion (I)– EDBL, lexical database created by the IXA

Group of the University of the Basque Country– Each word is linked to its variants, common

errors, old spellings, etc.– When a user enters a word, its standard form

or variants are suggested

Our approach

Page 28: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Lack of knowledge about the language:Variant suggestion (II)– Somehow lightens one of the problems of the

non-linguistically-tagged nature of the web:• in a tagged corpus, variants would be assigned the

correct lemma and would appear when looking for the lemma

• with our approach, the user can obtain the variants too

Our approach

Page 29: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 30: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for

Basque• EusBila, a search service for Basque• Evaluation

Page 31: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• System architecture:– User enters word– Query the EDBL for variants– Query morphological generator to obtain

conjugations and inflections– Query APIs of search engines– Download pages– Find occurrences of the forms of the word– Query LangId for language occurrences are in– Show KWiCs and counts

CorpEus

Page 32: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

EDBL (IXA)

Morphologicalgenerator (IXA)

Search engines’APIs

W W W

LangId (IXA)

CorpEus

Word

Variants

Word, variants

Inflections, conjugations

Search phrase

URLs

URLs

Web pages

Occurrence contexts

Language

Word

Occurrence KWiCs and counts

User

Page 33: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (I):– Lemma-based search– Language-filtered search– Variant suggestion

CorpEus

Page 34: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 35: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (II):– Ambiguous or unrecognised words:

• The user chooses the analysis upon which to base the morphological generation

CorpEus

Page 36: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 37: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (III):– Search for more than one word:

• Lemma-based search performed for all of them• Occurrences of any of the words are shown

CorpEus

Page 38: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 39: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (IV):– Noun phrase or term searching:

• Enclosing various terms in double quotes• Morphological generation applied to last word• Thus, proper lemma-based search for whole noun

phrases or terms (in Basque, only the last component of the noun phrase or term is inflected)

CorpEus

Page 40: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 41: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (V):– Different ordering criteria:

• Pages arriving order (default)• Form of searched word• Context after the word• Context before the word

– Ordered on the fly as they arrive

CorpEus

Page 42: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 43: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (VI):– Analysis of the words:

• Possible lemmas and POSs of the forms of the searched word are shown in a floating box

• Different colours:– Light green: correct word, unambiguous– Dark green: variant, unambiguous– Light yellow: correct word, ambiguous– Dark yellow: variant, ambiguous– Red: unrecognised word

CorpEus

Page 44: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 45: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (VII):– Count charts:

• Word forms• Possible lemma or POS• Word before or after• Lemma of word before or after• …

CorpEus

Page 46: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 47: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Features (VIII):– Many textual content file types:

• HTML• XML• RSS• TXT• PDF• DOC• RTF• PPT• XLS• …

– Parallel downloading of pages to avoid blocking

CorpEus

Page 48: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Demo: http://www.corpeus.org

CorpEus

Page 49: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 50: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 51: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Search engines don’t work well for Basque• We decided to build a search service for

Basque based on the principles of CorpEus:– API based– Lemma-based search– Language-filtered search– Variant suggestion

• But return URLs and snippets, not KWiCs or charts

EusBila

Page 52: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque
Page 53: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Problem: limit of calls per day of the APIs– Google: 1,000 calls per day– Yahoo!: 5,000 calls per day– Windows Live Search: 10,000 calls per day

• The limits can be enough for a corpus tool, but not for a general use search service

• Microsoft recently augmented the limit in calls per day to 25,000 and also launched an unlimited use commercial license

EusBila

Page 54: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Published a paper in iNEWS07 (Improving Non-English Web Searching), a workshop in SIGIR’07 (July 2007, Amsterdam)

• It aroused interest, as it is a cost-effective web search solution that can be used by other minority languages with few resources

EusBila

Page 55: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Launch:– By Eleka Ingeniaritza Linguistikoa– Under commercial name Elebila– October 2007

EusBila

Page 56: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Demo: EusBila

EusBila

Page 57: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 58: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Contents

• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation

Page 59: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• The methodolgy used in EusBila and CorpEus was evaluated for the iNEWS07 paper on EusBila

• We evaluated:– Gain in recall due to morphological query

expansion– Gain in precision due to language-filtering

words– Loss in recall due to language-filtering words

Evaluation

Page 60: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

• Indicator for precision: percentage of results that were actually in Basque

• Indicator for recall: estimated hit counts returned by the API

• Compared Windows Live Search’s API with EusBila using this same API

• The words for the evaluation were taken from the search logs of a very popular science portal in Basque

Evaluation

Page 61: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

Evaluation

Evaluation Condition Measured variable

Result

Language-filtering words

Morphological query expansion

Words

Gain in recall due to morphological query expansion

Not applied - Only Basque

Hit counts 89.43% increase

Gain in precision due to language-filtering words

- Not applied Any kind

% of results in Basque

70.55 points increase, from 27.19% to 97.74%

Loss in recall due to language-filtering words

- Not applied Only Basque

Hit counts Decrease from 6.48% to 57.69%, depending on the number of language-filtering words*

Gain in recall due to morphological query expansion

Applied - Any kind

Hit counts 40.19% increase

* The amount of filtering words can optionally be reduced to increase the recall when few results are returned

Page 62: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque

CorpEus, a ‘web as corpus’ tooldesigned for the

agglutinative nature of Basque

I. Leturia, A. Gurrutxaga1,I. Alegria, A. Ezeiza2

WAC3 – September 15-16, 2007 – Louvain-la-Neuve

1 Elhuyar R&D, Usurbil, Basque Country2 IXA Group, University of the Basque Country, Donostia, Basque Country