1 automatic indexing with the eurovoc thesaurus enabling cross-lingual search marie francine moens...

1

Automatic Indexing with the EuroVoc Thesaurus

Enabling Cross-lingual Search

Marie Francine Moens Katholieke Universiteit Leuven, Belgium

Frane Šarić University of Zagreb, FER, Croatia

18-19 November 2010, Luxembourg – Kirchberg

2

CADIAL project

Computer Aided Document Indexing for Accessing Legislation

A joint Flemish-Croatian project Partners:

Katholieke Universiteit Leuven (prof. Marie-Francine Moens)

University of Zagreb & Hidra (prof. Bojana Dalbelo Bašić, prof. Marko Tadić)

Goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia

3

CADIAL project (cont.)

1. Manually index 10.000 documents

eCADIS – semi-automatic document indexing

2. Use that data to train automatic indexers

Trained automatic classifiers for every EuroVoc

descriptor

3. Provide indexed data to custom search engine

CADIAL search engine

4

eCADIS

Computer Aided Document Indexing System

Provides useful information that helps indexers index documents more quickly

Counts n-grams Includes word normalization

Extracts collocations Suggests appropriate descriptors

Uses automatically trained classifiers

5

eCADIS (cont.)

6

Morphological normalization

Croatian = morphologically complex language Inflectional variation

Derivational variation

7

Morphological normalization

Lexicon-based normalization [Snajder et al. 2009]: Inflectional and derivational rules String transformation functions Higher order functional representation of

Croatian inflectional morphology: Inflectional rules Transformations: higher-order functions

8

Named entity recognition

Named entity recognition = semantic classification of entity name (usually proper name) [Bekavac & Tadic 2009]:

Person, location, organization, date, ... Use of lists of names and use of finite state

automata

9

Lexical association metrics

Collocation: meaning of a compound term cannot be inferred from meaning its individual terms

Collocations are valuable index terms Several methods were developed:

Based on extraction of terms in Wikipedia that are linked filtered by acceptable Part-Of-Speech patterns [Bekavac & Tadic 2009]

Terme-X: use of lexical association measures to build a dictionary of collocations filtered by acceptable Part-Of-Speech patterns (e.g., chi-square, log likelihood ration for a binomial distribution, pointwise mutual information statistic [Delac et al. 2009]

Using a genetic programming algorithm for learning a language adapted lexical association measure [Snajder et al. 2009]

10

Text categorization

= Assignment of terms of the EUROVOC thesaurus

Currently done at the statute level Problem

Large number of features (terms) and often few training examples => feature selection: chi square, frequent

item sets, linear classifier weights, ... [Boiy & Moens 2009]

Use of common classification algorithms: support vector machines, logistic regression, ... [Saric et al. in preparation]

EUROVOC = multilingual => terms can be used in cross-lingual retrieval

12

Text categorization

Core of the CADIAL project System suggests index terms to the

human indexers High performance of the categorization:

e.g., in the 80% F1 measure As number of categorized documents

grow, we hope to learn better classification models

Possibility to exploit the hierarchical organization of the thesaurus term to improve accuracy of the categorization

14[Bennett & Nguyen 2009]

15

TMT: Object-oriented text classification library

16

Comparing document classification schemes

Problem: discrepancy of classification scheme (e.g. EUROVOC thesaurus) and natural clusters formed by the documents

How to find this discrepancy so that the classification scheme can be adapted? [Silic et al. 2009] Finding an optimal clustering and comparison

with the clusters formed by the documents classified built by ground truth categories of documents

Dimensionality reduction with principal component analysis (PCA): visualization of the clusters

18

CADIAL Search Engine

http://cadial.hidra.hr Full text search over a collection of 20,000 legal

documents Documents are automatically indexed using

EuroVoc descriptors Hidra assures that additional metadata is

correct: Regulation status (valid / invalid) Area of activity EU accession chapter

19

The CADIAL search engine

Possibility to search: Full text Titles EUROVOC thesaurus terms Historical versions ...

Legislation: semi-structured documents: possibility to take the structure into account when computing the relevance of article, section etc.

Successful participation at the INEX competition 2008 [Mijic et al. 2009]

20

CADIAL Search – live demo

21

CADIAL Search – document metadata

22

Towards cross-lingual search

23

Cross-lingual indexing

Classification/indexing of documents = supervised machine learning of the classification patterns based on annotated training examples

When multilingual documents are not linked: Demands manual annotation in different

languages Can be important manual effort:

Changing collections, taxonomies Many official languages in the EU

Transfer learning can be solution

24

Cross-lingual indexing

Potential of transfer learning techniques [Pan & Yang IEEE TKDE 2010]

Co-training and co-regularization techniques for learning classification patterns from documents in multiple languages [Amini et al. SIGIR 2010]

25

Conclusions

CADIAL = valuable example of automatic indexing enabling cross-lingual search

EUROVOC thesaurus is a valuable resource Many different future tracks of research aiming

at more flexible and accurate indexing

http://www.cadial.org/

26

The CADIAL project has received the 2009 Prime Minister Award for special achievements in the field of e-Government in Croatia and the 2009 "Golden Tesla's Egg" Award of the VIDI publishing house for the best innovative solution in ICT for the category Academic Institutions. The project was invited to participate at CeBIT 2010, the world's foremost tradeshow for the digital industry.

1 automatic indexing with the eurovoc thesaurus enabling cross-lingual search marie francine moens...

Documents

categorization slide

crosslingual search

republic of croatia

cadial search live demo

luxembourg kirchberg

cross lingual retrieval

legal documents documents

text search