1 automatic indexing with the eurovoc thesaurus enabling cross-lingual search marie francine moens...

26
1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić University of Zagreb, FER, Croatia 18-19 November 2010, Luxembourg – Kirchberg

Upload: ryan-mcelroy

Post on 27-Mar-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

1

Automatic Indexing with the EuroVoc Thesaurus

Enabling Cross-lingual Search

Marie Francine Moens Katholieke Universiteit Leuven, Belgium

Frane Šarić University of Zagreb, FER, Croatia

18-19 November 2010, Luxembourg – Kirchberg

Page 2: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

2

CADIAL project

Computer Aided Document Indexing for Accessing Legislation

A joint Flemish-Croatian project Partners:

Katholieke Universiteit Leuven (prof. Marie-Francine Moens)

University of Zagreb & Hidra (prof. Bojana Dalbelo Bašić, prof. Marko Tadić)

Goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

Page 3: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

3

CADIAL project (cont.)

1. Manually index 10.000 documents

eCADIS – semi-automatic document indexing

2. Use that data to train automatic indexers

Trained automatic classifiers for every EuroVoc

descriptor

3. Provide indexed data to custom search engine

CADIAL search engine

Page 4: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

4

eCADIS

Computer Aided Document Indexing System

Provides useful information that helps indexers index documents more quickly

Counts n-grams Includes word normalization

Extracts collocations Suggests appropriate descriptors

Uses automatically trained classifiers

Page 5: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

5

eCADIS (cont.)

Page 6: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

6

Morphological normalization

Croatian = morphologically complex language Inflectional variation

Derivational variation

Page 7: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

7

Morphological normalization

Lexicon-based normalization [Snajder et al. 2009]: Inflectional and derivational rules String transformation functions Higher order functional representation of

Croatian inflectional morphology: Inflectional rules Transformations: higher-order functions

Page 8: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

8

Named entity recognition

Named entity recognition = semantic classification of entity name (usually proper name) [Bekavac & Tadic 2009]:

Person, location, organization, date, ... Use of lists of names and use of finite state

automata

Page 9: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

9

Lexical association metrics

Collocation: meaning of a compound term cannot be inferred from meaning its individual terms

Collocations are valuable index terms Several methods were developed:

Based on extraction of terms in Wikipedia that are linked filtered by acceptable Part-Of-Speech patterns [Bekavac & Tadic 2009]

Terme-X: use of lexical association measures to build a dictionary of collocations filtered by acceptable Part-Of-Speech patterns (e.g., chi-square, log likelihood ration for a binomial distribution, pointwise mutual information statistic [Delac et al. 2009]

Using a genetic programming algorithm for learning a language adapted lexical association measure [Snajder et al. 2009]

Page 10: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

10

Text categorization

= Assignment of terms of the EUROVOC thesaurus

Currently done at the statute level Problem

Large number of features (terms) and often few training examples => feature selection: chi square, frequent

item sets, linear classifier weights, ... [Boiy & Moens 2009]

Use of common classification algorithms: support vector machines, logistic regression, ... [Saric et al. in preparation]

EUROVOC = multilingual => terms can be used in cross-lingual retrieval

Page 11: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

11

Page 12: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

12

Text categorization

Core of the CADIAL project System suggests index terms to the

human indexers High performance of the categorization:

e.g., in the 80% F1 measure As number of categorized documents

grow, we hope to learn better classification models

Possibility to exploit the hierarchical organization of the thesaurus term to improve accuracy of the categorization

Page 13: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

13

Page 14: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

14[Bennett & Nguyen 2009]

Page 15: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

15

TMT: Object-oriented text classification library

Page 16: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

16

Comparing document classification schemes

Problem: discrepancy of classification scheme (e.g. EUROVOC thesaurus) and natural clusters formed by the documents

How to find this discrepancy so that the classification scheme can be adapted? [Silic et al. 2009] Finding an optimal clustering and comparison

with the clusters formed by the documents classified built by ground truth categories of documents

Dimensionality reduction with principal component analysis (PCA): visualization of the clusters

Page 17: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

17

Page 18: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

18

CADIAL Search Engine

http://cadial.hidra.hr Full text search over a collection of 20,000 legal

documents Documents are automatically indexed using

EuroVoc descriptors Hidra assures that additional metadata is

correct: Regulation status (valid / invalid) Area of activity EU accession chapter

Page 19: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

19

The CADIAL search engine

Possibility to search: Full text Titles EUROVOC thesaurus terms Historical versions ...

Legislation: semi-structured documents: possibility to take the structure into account when computing the relevance of article, section etc.

Successful participation at the INEX competition 2008 [Mijic et al. 2009]

Page 20: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

20

CADIAL Search – live demo

Page 21: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

21

CADIAL Search – document metadata

Page 22: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

22

Towards cross-lingual search

Page 23: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

23

Cross-lingual indexing

Classification/indexing of documents = supervised machine learning of the classification patterns based on annotated training examples

When multilingual documents are not linked: Demands manual annotation in different

languages Can be important manual effort:

Changing collections, taxonomies Many official languages in the EU

Transfer learning can be solution

Page 24: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

24

Cross-lingual indexing

Potential of transfer learning techniques [Pan & Yang IEEE TKDE 2010]

Co-training and co-regularization techniques for learning classification patterns from documents in multiple languages [Amini et al. SIGIR 2010]

Page 25: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

25

Conclusions

CADIAL = valuable example of automatic indexing enabling cross-lingual search

EUROVOC thesaurus is a valuable resource Many different future tracks of research aiming

at more flexible and accurate indexing

http://www.cadial.org/

Page 26: 1 Automatic Indexing with the EuroVoc Thesaurus Enabling Cross-lingual Search Marie Francine Moens Katholieke Universiteit Leuven, Belgium Frane Šarić

26

The CADIAL project has received the 2009 Prime Minister Award for special achievements in the field of e-Government in Croatia and the 2009 "Golden Tesla's Egg" Award of the VIDI publishing house for the best innovative solution in ICT for the category Academic Institutions. The project was invited to participate at CeBIT 2010, the world's foremost tradeshow for the digital industry.