eurovoc does not yet exist for your language? the hungarian experience. tamás váradi...

Post on 31-Dec-2015

219 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Eurovoc does not yet exist for your language? The Hungarian experience.

Tamás Váradi

varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Overview of the project

• Objectives

• Partners

• Resources

• Methods

• Results

• Conclusions

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Project objectives

• Hungarian EUROVOC version

– only a draft version planned at first

– an authorative full-scale system

• Automatic indexing of documents

– using the technology developed at JRC

– prototype system for one domain

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Partners

• Project consortium:

– HAS RIL (coordinator)

– MorphoLogic Kft. (partner)

• Collaborators:

– JRC, Ispra

– Hungarian Parliament

– Ministry of Justice

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Resources

• NLP toolset (RIL)

• Digital dictionaries, software technology (MorphoLogic)

• Indexing technology (JRC Ispra)

• Terminology database, translation, supervision expertise (Justice Ministry)

• Coordination funding of Hungarian EUROVOC (Hungarian Parliament)

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

EUROVOC translation

• Done by the Translation Coordination Unit of the Ministry of Justice

• Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire

• Maintaining an online Terminological Database

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Terminological Database

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Translation process

• English, French, German & Spanish EUROVOC versions in xml files

• Automatic lookup of Terminological Database (cc. 20% coverage)

• Notepad2 xml-aware editor used

• micro-thesauri translated first, corresponding descriptors second

• pool of experts consulted when needed

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Indexing strategies

• Corpus: Hungarian translation of Acquis Communitaire

• Two approaches

1. To translate English associate terms (possible short-cut?)

2. To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Translation of associate terms

• Hypothesis:

– relation between English associate term and

EUROVOC descriptor is language independent

– hence Hungarian equivalent of English term

will also serve as appropriate associate term in

Hungarian texts

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Online dictionary lookup

• MorphoLogic Online English-Hungarian dictionaries applied

• 24.7 % direct match

<LIBELLE_EN>suspension of payments</LIBELLE_EN><LIBELLE_DE>Zahlungseinstellung</LIBELLE_DE><LIBELLE_FR>cessation de paiement</LIBELLE_FR><LIBELLE_ES>suspensión de pagos</LIBELLE_ES><LIBELLE_HU>kifizetések felfüggesztése</LIBELLE_HU>

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Manual check of automatic assignments

• Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts

the Hungarian terms must be looked up in the translation corpus as well

parallel corpus aligned at least on the document level must be compiled

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Manual check

<LIBELLE_EN>sales promotion</LIBELLE_EN><LIBELLE_DE>Absatzförderung</LIBELLE_DE><LIBELLE_FR>promotion commerciale</LIBELLE_FR><LIBELLE_ES>promoción comercial</LIBELLE_ES><LIBELLE_HU>eladásösztönzés</LIBELLE_HU>

• Even frequency lists are useful:

Reklám 149Promóció 60Eladásösztönzés 1

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Manual check

<LIBELLE_EN>toxic substance</LIBELLE_EN><LIBELLE_DE>Giftstoff</LIBELLE_DE><LIBELLE_FR>substance toxique</LIBELLE_FR><LIBELLE_ES>sustancia tóxica</LIBELLE_ES><LIBELLE_HU>toxikus anyagok</LIBELLE_HU><LIBELLE_HU>mérgező anyagok</LIBELLE_HU>

• Even frequency lists are useful:

Equallyfrequent

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Generation of Hungarian associate-lists

• Tasks

1. Compile corpus of Hungarian translation of

Acquis Communitaire

2. Tag and lemmatize words

3. Compile list of stop words

4. Run automatic indexing tools (JRC)

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Hungarian Acquis Communautaire corpus

• 8308 files

<!ELEMENT document (title+,text,lemmatised, descriptors,description) >

HUN tokens 21,899,924

EN tokens 20,394,088

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

English stop-word list

• English stop word list: 1720 items

– function words

– "EUspeak"• objective, arrangements, committee

– Some strange multiword stringsnecessary_to_comply_with_this_directiveforward_this_resolution_to_the_commission

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Hungarian stop-word list

1. translated English items

2. checked their occurrence in HU CELEX

3. generated unigram,bigram and trigram frequency lists from HU CELEX corpus

4. checked first 3000 items on each list and added to the stwd list if needed

5. double checked infrequent items on English translation list and replaced translation with synonyms

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Hungarian stop-word list

single word entries 1265

multi-word entries 752

Total 2017

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Automatic indexing run 1

7971 texts divided into 3 sets:(total length of 65702474 chars)

1. 202 optimisation (evaluation set)

2. 179 final evaluation (test set)

3. 7590 the training set

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Precision/recall in terms of number of Eurovoc descriptors

Rank Precision Recall Prec RT Rec RT F1-measure

1 80.000 16.286 82.857 17.238 27.0627090127329

2 67.143 25.143 77.143 28.571 36.5857540472011

3 63.810 32.857 75.714 39.238 43.3778884210744

4 59.048 38.286 70.476 43.714 46.4526625434072

5 57.762 44.095 70.048 50.190 50.0115925267777

6 55.571 47.524 68.333 53.143 51.23344883845

7 52.170 48.476 65.408 54.095 50.255209745047

8 49.976 49.905 62.857 55.524 49.9404747649703

9 48.587 51.810 62.143 57.905 50.1467667360579

10 46.619 52.286 60.143 58.381 49.2901477983924

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Evaluation in terms of rank

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Precision/Recall graph

:

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Conclusions

• First run already yields results comparable to other languages

• scope for fine-tunig/filtering process

• interesting to compare results gained from the two approaches

top related