eurovoc does not yet exist for your language? the hungarian experience. tamás váradi...

24
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi [email protected]

Upload: phoebe-hubbard

Post on 31-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

Eurovoc does not yet exist for your language? The Hungarian experience.

Tamás Váradi

[email protected]

Page 2: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Overview of the project

• Objectives

• Partners

• Resources

• Methods

• Results

• Conclusions

Page 3: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Project objectives

• Hungarian EUROVOC version

– only a draft version planned at first

– an authorative full-scale system

• Automatic indexing of documents

– using the technology developed at JRC

– prototype system for one domain

Page 4: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Partners

• Project consortium:

– HAS RIL (coordinator)

– MorphoLogic Kft. (partner)

• Collaborators:

– JRC, Ispra

– Hungarian Parliament

– Ministry of Justice

Page 5: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Resources

• NLP toolset (RIL)

• Digital dictionaries, software technology (MorphoLogic)

• Indexing technology (JRC Ispra)

• Terminology database, translation, supervision expertise (Justice Ministry)

• Coordination funding of Hungarian EUROVOC (Hungarian Parliament)

Page 6: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

EUROVOC translation

• Done by the Translation Coordination Unit of the Ministry of Justice

• Team coordinating the massive effort of preparing the Hungarian translation of Acquis Communitaire

• Maintaining an online Terminological Database

Page 7: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Terminological Database

Page 8: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Translation process

• English, French, German & Spanish EUROVOC versions in xml files

• Automatic lookup of Terminological Database (cc. 20% coverage)

• Notepad2 xml-aware editor used

• micro-thesauri translated first, corresponding descriptors second

• pool of experts consulted when needed

Page 9: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Indexing strategies

• Corpus: Hungarian translation of Acquis Communitaire

• Two approaches

1. To translate English associate terms (possible short-cut?)

2. To reconstruct the generation of associate terms by running the JRC technology on the Hungarian data

Page 10: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Translation of associate terms

• Hypothesis:

– relation between English associate term and

EUROVOC descriptor is language independent

– hence Hungarian equivalent of English term

will also serve as appropriate associate term in

Hungarian texts

Page 11: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Online dictionary lookup

• MorphoLogic Online English-Hungarian dictionaries applied

• 24.7 % direct match

<LIBELLE_EN>suspension of payments</LIBELLE_EN><LIBELLE_DE>Zahlungseinstellung</LIBELLE_DE><LIBELLE_FR>cessation de paiement</LIBELLE_FR><LIBELLE_ES>suspensión de pagos</LIBELLE_ES><LIBELLE_HU>kifizetések felfüggesztése</LIBELLE_HU>

Page 12: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Manual check of automatic assignments

• Equivalence cannot be judged on its own merits: the Hungarian equivalent must be the one occuring in the texts

the Hungarian terms must be looked up in the translation corpus as well

parallel corpus aligned at least on the document level must be compiled

Page 13: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Manual check

<LIBELLE_EN>sales promotion</LIBELLE_EN><LIBELLE_DE>Absatzförderung</LIBELLE_DE><LIBELLE_FR>promotion commerciale</LIBELLE_FR><LIBELLE_ES>promoción comercial</LIBELLE_ES><LIBELLE_HU>eladásösztönzés</LIBELLE_HU>

• Even frequency lists are useful:

Reklám 149Promóció 60Eladásösztönzés 1

Page 14: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Manual check

<LIBELLE_EN>toxic substance</LIBELLE_EN><LIBELLE_DE>Giftstoff</LIBELLE_DE><LIBELLE_FR>substance toxique</LIBELLE_FR><LIBELLE_ES>sustancia tóxica</LIBELLE_ES><LIBELLE_HU>toxikus anyagok</LIBELLE_HU><LIBELLE_HU>mérgező anyagok</LIBELLE_HU>

• Even frequency lists are useful:

Equallyfrequent

Page 15: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Generation of Hungarian associate-lists

• Tasks

1. Compile corpus of Hungarian translation of

Acquis Communitaire

2. Tag and lemmatize words

3. Compile list of stop words

4. Run automatic indexing tools (JRC)

Page 16: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Hungarian Acquis Communautaire corpus

• 8308 files

<!ELEMENT document (title+,text,lemmatised, descriptors,description) >

HUN tokens 21,899,924

EN tokens 20,394,088

Page 17: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

English stop-word list

• English stop word list: 1720 items

– function words

– "EUspeak"• objective, arrangements, committee

– Some strange multiword stringsnecessary_to_comply_with_this_directiveforward_this_resolution_to_the_commission

Page 18: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Hungarian stop-word list

1. translated English items

2. checked their occurrence in HU CELEX

3. generated unigram,bigram and trigram frequency lists from HU CELEX corpus

4. checked first 3000 items on each list and added to the stwd list if needed

5. double checked infrequent items on English translation list and replaced translation with synonyms

Page 19: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Hungarian stop-word list

single word entries 1265

multi-word entries 752

Total 2017

Page 20: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Automatic indexing run 1

7971 texts divided into 3 sets:(total length of 65702474 chars)

1. 202 optimisation (evaluation set)

2. 179 final evaluation (test set)

3. 7590 the training set

Page 21: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Precision/recall in terms of number of Eurovoc descriptors

Rank Precision Recall Prec RT Rec RT F1-measure

1 80.000 16.286 82.857 17.238 27.0627090127329

2 67.143 25.143 77.143 28.571 36.5857540472011

3 63.810 32.857 75.714 39.238 43.3778884210744

4 59.048 38.286 70.476 43.714 46.4526625434072

5 57.762 44.095 70.048 50.190 50.0115925267777

6 55.571 47.524 68.333 53.143 51.23344883845

7 52.170 48.476 65.408 54.095 50.255209745047

8 49.976 49.905 62.857 55.524 49.9404747649703

9 48.587 51.810 62.143 57.905 50.1467667360579

10 46.619 52.286 60.143 58.381 49.2901477983924

Page 22: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Evaluation in terms of rank

Page 23: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Precision/Recall graph

:

Page 24: Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi varadi@nytud.hu

JRC Ispra, 17/ 09/ 2004EUROVOC Indexing Workshop

Conclusions

• First run already yields results comparable to other languages

• scope for fine-tunig/filtering process

• interesting to compare results gained from the two approaches