1 automatic indexing with the eurovoc thesaurus enabling cross-lingual search marie francine moens...
TRANSCRIPT
1
Automatic Indexing with the EuroVoc Thesaurus
Enabling Cross-lingual Search
Marie Francine Moens Katholieke Universiteit Leuven, Belgium
Frane Šarić University of Zagreb, FER, Croatia
18-19 November 2010, Luxembourg – Kirchberg
2
CADIAL project
Computer Aided Document Indexing for Accessing Legislation
A joint Flemish-Croatian project Partners:
Katholieke Universiteit Leuven (prof. Marie-Francine Moens)
University of Zagreb & Hidra (prof. Bojana Dalbelo Bašić, prof. Marko Tadić)
Goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia
3
CADIAL project (cont.)
1. Manually index 10.000 documents
eCADIS – semi-automatic document indexing
2. Use that data to train automatic indexers
Trained automatic classifiers for every EuroVoc
descriptor
3. Provide indexed data to custom search engine
CADIAL search engine
4
eCADIS
Computer Aided Document Indexing System
Provides useful information that helps indexers index documents more quickly
Counts n-grams Includes word normalization
Extracts collocations Suggests appropriate descriptors
Uses automatically trained classifiers
5
eCADIS (cont.)
6
Morphological normalization
Croatian = morphologically complex language Inflectional variation
Derivational variation
7
Morphological normalization
Lexicon-based normalization [Snajder et al. 2009]: Inflectional and derivational rules String transformation functions Higher order functional representation of
Croatian inflectional morphology: Inflectional rules Transformations: higher-order functions
8
Named entity recognition
Named entity recognition = semantic classification of entity name (usually proper name) [Bekavac & Tadic 2009]:
Person, location, organization, date, ... Use of lists of names and use of finite state
automata
9
Lexical association metrics
Collocation: meaning of a compound term cannot be inferred from meaning its individual terms
Collocations are valuable index terms Several methods were developed:
Based on extraction of terms in Wikipedia that are linked filtered by acceptable Part-Of-Speech patterns [Bekavac & Tadic 2009]
Terme-X: use of lexical association measures to build a dictionary of collocations filtered by acceptable Part-Of-Speech patterns (e.g., chi-square, log likelihood ration for a binomial distribution, pointwise mutual information statistic [Delac et al. 2009]
Using a genetic programming algorithm for learning a language adapted lexical association measure [Snajder et al. 2009]
10
Text categorization
= Assignment of terms of the EUROVOC thesaurus
Currently done at the statute level Problem
Large number of features (terms) and often few training examples => feature selection: chi square, frequent
item sets, linear classifier weights, ... [Boiy & Moens 2009]
Use of common classification algorithms: support vector machines, logistic regression, ... [Saric et al. in preparation]
EUROVOC = multilingual => terms can be used in cross-lingual retrieval
11
12
Text categorization
Core of the CADIAL project System suggests index terms to the
human indexers High performance of the categorization:
e.g., in the 80% F1 measure As number of categorized documents
grow, we hope to learn better classification models
Possibility to exploit the hierarchical organization of the thesaurus term to improve accuracy of the categorization
13
14[Bennett & Nguyen 2009]
15
TMT: Object-oriented text classification library
16
Comparing document classification schemes
Problem: discrepancy of classification scheme (e.g. EUROVOC thesaurus) and natural clusters formed by the documents
How to find this discrepancy so that the classification scheme can be adapted? [Silic et al. 2009] Finding an optimal clustering and comparison
with the clusters formed by the documents classified built by ground truth categories of documents
Dimensionality reduction with principal component analysis (PCA): visualization of the clusters
17
18
CADIAL Search Engine
http://cadial.hidra.hr Full text search over a collection of 20,000 legal
documents Documents are automatically indexed using
EuroVoc descriptors Hidra assures that additional metadata is
correct: Regulation status (valid / invalid) Area of activity EU accession chapter
19
The CADIAL search engine
Possibility to search: Full text Titles EUROVOC thesaurus terms Historical versions ...
Legislation: semi-structured documents: possibility to take the structure into account when computing the relevance of article, section etc.
Successful participation at the INEX competition 2008 [Mijic et al. 2009]
20
CADIAL Search – live demo
21
CADIAL Search – document metadata
22
Towards cross-lingual search
23
Cross-lingual indexing
Classification/indexing of documents = supervised machine learning of the classification patterns based on annotated training examples
When multilingual documents are not linked: Demands manual annotation in different
languages Can be important manual effort:
Changing collections, taxonomies Many official languages in the EU
Transfer learning can be solution
24
Cross-lingual indexing
Potential of transfer learning techniques [Pan & Yang IEEE TKDE 2010]
Co-training and co-regularization techniques for learning classification patterns from documents in multiple languages [Amini et al. SIGIR 2010]
25
Conclusions
CADIAL = valuable example of automatic indexing enabling cross-lingual search
EUROVOC thesaurus is a valuable resource Many different future tracks of research aiming
at more flexible and accurate indexing
http://www.cadial.org/
26
The CADIAL project has received the 2009 Prime Minister Award for special achievements in the field of e-Government in Croatia and the 2009 "Golden Tesla's Egg" Award of the VIDI publishing house for the best innovative solution in ICT for the category Academic Institutions. The project was invited to participate at CeBIT 2010, the world's foremost tradeshow for the digital industry.