corpeus, a ‘web as corpus’ tool designed for the agglutinative nature of basque
DESCRIPTION
CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. I. Leturia, A. Gurrutxaga 1 , I. Alegria, A. Ezeiza 2 WAC3 – September 15-16, 2007 – Louvain-la-Neuve 1 Elhuyar R&D, Usurbil, Basque Country - PowerPoint PPT PresentationTRANSCRIPT
CorpEus, a ‘web as corpus’ tooldesigned for the
agglutinative nature of Basque
I. Leturia, A. Gurrutxaga1,I. Alegria, A. Ezeiza2
WAC3 – September 15-16, 2007 – Louvain-la-Neuve
1 Elhuyar R&D, Usurbil, Basque Country2 IXA Group, University of the Basque Country, Donostia, Basque Country
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Motivation
• No doubt corpora are necessary:– for linguistic research– for language normalization– for developing language technologies
• But many corpora are exclusively used for these purposes
• They are not made publicly available and searchable through the Internet
Motivation
• For Basque, it is essential to have corpora available for querying– Standardization of Basque started only in 1968– Many rules, words and spellings have been changing
since; still, every now and then new rules are released by the Academy of Basque Language
– It was not taught in schools until the seventies and in universities until the eighties
– No decision as to the correct word or spelling has yet been taken in many areas or words
– Even written production abounds with misspellings, errors, uncertainties, etc.
Motivation
• Basque speaking community needs corpora– Teachers– Writers– Technical text producers– Dictionary makers– Translators– Students– Academics in the field of standardization
• Basque is not a language rich in corpora– Few, small and not updated
Motivation
• Only corpora available (I):– XX. mendeko euskararen corpusa:
• Academy of the Basque language• 4.6 million words• Balanced• Literary texts• Twentieth century• http://www.euskaracorpusa.net/XXmendea/Konts_
arrunta_fr.html
Motivation
• Only corpora available (II):– Ereduzko prosa gaur:
• University of the Basque Country• 23.8 million words• Literary and press texts regarded as “reference”• 2000 - 2005• http://www.ehu.es/euskara-orria/euskara/ereduzkoa/
araka.html
Motivation
• Only corpora available (III):– Zientzia eta teknologiaren corpusa:
• Elhuyar Foundation and the IXA Group of the University of the Basque Country
• 7.6 million words• Texts on science and technology• 1990 - 2002• http://www.ztcorpusa.net
Motivation
• Only corpora available (IV):– Klasikoen gordailua:
• Susa publishing house• 10.7 million words• Non-tagged• Classic texts• http://klasikoak.armiarma.com/corpus.htm
Motivation
• But we do have the Internet– Huge repository of texts– Constantly updated
• A tool for querying the Internet as if it were a Basque corpus would be very interesting
Motivation
• Also disadvantages:– Not linguistically tagged:
• Always some uncertainty• Variants and misspellings will not appear when
looking for a word
– It will never show all, only what there is in the first results returned by search engines
– The Internet is often considered non-representative
– The Internet is full of redundancy
Motivation
• Nevertheless, we thought that the benefits far exceeded the disadvantages
• We embarked on a project to build a ‘web as corpus’ tool for Basque
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Problems with Basque language
• Similar services exist:– WebConc (http://www.niederlandistik.fu-berlin
.de/cgi-bin/web-conc.cgi)– WebCorp (http://www.webcorp.org.uk/)– KWiCFinder (http://www.kwicfinder.com)
• But these rely on search engines• Search engines don’t work well for Basque
Problems with Basque language
• Looking for conjugations and inflections– Basque is an agglutinative language
• A given lemma makes many different word forms– lan (“work”): lana (“the work”), lanak (“works” or “the
works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)…
– Looking only for the exact given word, or the word plus an “s” for the plural, is not enough
– Wildcards are not an appropriate solution• Looking for lan* would also return forms of the
words lanabes (“tool”), lanbro (“fog”)…
Problems with Basque language
• Language discrimination– No search engine offers the possibility of
returning only pages in Basque– Big problem when looking for:
• Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia…
• Short words: katu (“cat”), ur (“water”)…• Proper nouns: Egipto, Newton, Pluton…
– Many non-Basque results are returned, often no Basque results at all
Problems with Basque language
• Lack of knowledge about the language– Status of language:
• Late standardization• Still many changes in words and rules• Late teaching in schools and universities• Many non-standardised areas or words• Many misspellings and errors in written production
– A word might be incorrect but appear often in the web
– The user might think it is correct, without knowing that a more appropriate word exists
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Our approach
• Looking for conjugations and inflections: Morphological query expansion (I)– Morphological generator created by the IXA
Group of the University of the Basque Country– We obtain all the forms of a given lemma– We ask the search engine for all of them using
an OR operator– etxe (“house”) => etxe OR etxea OR etxeak
OR etxeari OR etxeek OR …
Our approach
• Looking for conjugations and inflections: Morphological query expansion (II)– Little problems:
• The APIs of the search engines have each a limit in number of words or length of search phrase
– we had to discover the limits by trial and error
• Due to these limits, real lemmatised search is impossible
– we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words
– these are the forms of the words sent in the query
Our approach
• Language discrimination:Language-filtering words (I)– We looked in a corpus for the most frequent
words in Basque– We include them in the search phrase using an
AND operator
Our approach
• Language discrimination:Language-filtering words (II)– Little problems (I):
• The most frequent words in Basque exist in other languages too
• Several language-filtering words had to be used– the more of these, the more we gained in precision (fewer
non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa
– we chose precision and include four filtering words– if few results are returned, the user can try again
increasing the recall
Our approach
• Language discrimination:Language-filtering words (III)– Little problems (II):
• In bilingual pages, the searched word can be in a piece of text that is not in Basque
– LangId, a free language identifier developed by the IXA Group of the University of the Basque Country
– applied to some context around the words to see if it is in a piece of text in Basque
– it does not work well with small contexts, but if the context is too big pieces in other languages can be included
– we start with quite a broad context and progressively reduce its length until minimal length for LangId to work properly is reached
– if at any time LangId says it is in Basque, we stop and we show it
• Lack of knowledge about the language:Variant suggestion (I)– EDBL, lexical database created by the IXA
Group of the University of the Basque Country– Each word is linked to its variants, common
errors, old spellings, etc.– When a user enters a word, its standard form
or variants are suggested
Our approach
• Lack of knowledge about the language:Variant suggestion (II)– Somehow lightens one of the problems of the
non-linguistically-tagged nature of the web:• in a tagged corpus, variants would be assigned the
correct lemma and would appear when looking for the lemma
• with our approach, the user can obtain the variants too
Our approach
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for
Basque• EusBila, a search service for Basque• Evaluation
• System architecture:– User enters word– Query the EDBL for variants– Query morphological generator to obtain
conjugations and inflections– Query APIs of search engines– Download pages– Find occurrences of the forms of the word– Query LangId for language occurrences are in– Show KWiCs and counts
CorpEus
EDBL (IXA)
Morphologicalgenerator (IXA)
Search engines’APIs
W W W
LangId (IXA)
CorpEus
Word
Variants
Word, variants
Inflections, conjugations
Search phrase
URLs
URLs
Web pages
Occurrence contexts
Language
Word
Occurrence KWiCs and counts
User
• Features (I):– Lemma-based search– Language-filtered search– Variant suggestion
CorpEus
• Features (II):– Ambiguous or unrecognised words:
• The user chooses the analysis upon which to base the morphological generation
CorpEus
• Features (III):– Search for more than one word:
• Lemma-based search performed for all of them• Occurrences of any of the words are shown
CorpEus
• Features (IV):– Noun phrase or term searching:
• Enclosing various terms in double quotes• Morphological generation applied to last word• Thus, proper lemma-based search for whole noun
phrases or terms (in Basque, only the last component of the noun phrase or term is inflected)
CorpEus
• Features (V):– Different ordering criteria:
• Pages arriving order (default)• Form of searched word• Context after the word• Context before the word
– Ordered on the fly as they arrive
CorpEus
• Features (VI):– Analysis of the words:
• Possible lemmas and POSs of the forms of the searched word are shown in a floating box
• Different colours:– Light green: correct word, unambiguous– Dark green: variant, unambiguous– Light yellow: correct word, ambiguous– Dark yellow: variant, ambiguous– Red: unrecognised word
CorpEus
• Features (VII):– Count charts:
• Word forms• Possible lemma or POS• Word before or after• Lemma of word before or after• …
CorpEus
• Features (VIII):– Many textual content file types:
• HTML• XML• RSS• TXT• PDF• DOC• RTF• PPT• XLS• …
– Parallel downloading of pages to avoid blocking
CorpEus
• Demo: http://www.corpeus.org
CorpEus
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
• Search engines don’t work well for Basque• We decided to build a search service for
Basque based on the principles of CorpEus:– API based– Lemma-based search– Language-filtered search– Variant suggestion
• But return URLs and snippets, not KWiCs or charts
EusBila
• Problem: limit of calls per day of the APIs– Google: 1,000 calls per day– Yahoo!: 5,000 calls per day– Windows Live Search: 10,000 calls per day
• The limits can be enough for a corpus tool, but not for a general use search service
• Microsoft recently augmented the limit in calls per day to 25,000 and also launched an unlimited use commercial license
EusBila
• Published a paper in iNEWS07 (Improving Non-English Web Searching), a workshop in SIGIR’07 (July 2007, Amsterdam)
• It aroused interest, as it is a cost-effective web search solution that can be used by other minority languages with few resources
EusBila
• Launch:– By Eleka Ingeniaritza Linguistikoa– Under commercial name Elebila– October 2007
EusBila
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
• The methodolgy used in EusBila and CorpEus was evaluated for the iNEWS07 paper on EusBila
• We evaluated:– Gain in recall due to morphological query
expansion– Gain in precision due to language-filtering
words– Loss in recall due to language-filtering words
Evaluation
• Indicator for precision: percentage of results that were actually in Basque
• Indicator for recall: estimated hit counts returned by the API
• Compared Windows Live Search’s API with EusBila using this same API
• The words for the evaluation were taken from the search logs of a very popular science portal in Basque
Evaluation
Evaluation
Evaluation Condition Measured variable
Result
Language-filtering words
Morphological query expansion
Words
Gain in recall due to morphological query expansion
Not applied - Only Basque
Hit counts 89.43% increase
Gain in precision due to language-filtering words
- Not applied Any kind
% of results in Basque
70.55 points increase, from 27.19% to 97.74%
Loss in recall due to language-filtering words
- Not applied Only Basque
Hit counts Decrease from 6.48% to 57.69%, depending on the number of language-filtering words*
Gain in recall due to morphological query expansion
Applied - Any kind
Hit counts 40.19% increase
* The amount of filtering words can optionally be reduced to increase the recall when few results are returned
CorpEus, a ‘web as corpus’ tooldesigned for the
agglutinative nature of Basque
I. Leturia, A. Gurrutxaga1,I. Alegria, A. Ezeiza2
WAC3 – September 15-16, 2007 – Louvain-la-Neuve
1 Elhuyar R&D, Usurbil, Basque Country2 IXA Group, University of the Basque Country, Donostia, Basque Country