automatic eurovoc indexing bruno pouliquen joint research centre, european commission ispra-italy ...

49
Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy http://www.jrc.cec.eu.int/langtech Addressing the Language Barrier Problem in the Enlarged EU Automating Eurovoc Descriptor Assignment

Upload: beverly-brown

Post on 14-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Automatic Eurovoc Indexing

Bruno PouliquenJoint Research Centre, European Commission

Ispra-Italy

http://www.jrc.cec.eu.int/langtech

Addressing the Language Barrier Problem in the Enlarged EU

Automating Eurovoc Descriptor Assignment

Page 2: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Contents

• Overview of the process• Background

– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space

• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters

• Assignment– “guessing” the descriptors for a new text– …results

Page 3: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Overview: starting point

• Set of texts, manually indexedPoland

PolandprogrammePolish

Poland

EuropePoland

Poland

PolandCultureprogrammeArtisticPolish

Cultural policy

CultureprogrammeEurope

Cultural policy

EuropeCulturalprogramme

Cultural policy

revivalartisticCultureprogramme

Cultural policy

Page 4: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Overview: learning processfor automatic assignment

• Produce descriptor profiles

Poland

Poland 23Polish 20… Producers 9…

Culture 41 Cultural 32…Artistic 21Revival 10…

Cultural policy

Poland

PolandprogrammePolish…

PolandEuropePoland…

Poland

PolandCultureprogrammeArtisticPolish

Cultural policyCultureprogrammeEurope…

Cultural policy

EuropeCulturalprogramme…

Cultural policy

revivalartisticCultureprogramme…

Cultural policy

Page 5: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Overview: assignment

• A new document is compared with descriptor profiles

...

Page 6: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Overview: assignment results

DECISION No 3/96 OF THE ASSOCIATION COUNCIL between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part of 16 July 1996 settling the dispute between the European Communities and the Republic of Poland concerning skins and hides in accordance with Article 105 (1) and (2) of the Europe Agreement between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part (96/496/Euratom, ECSC, EC)

THE ASSOCIATION COUNCIL,Having regard to the Europe Agreement establishing an Association between the European Communities and their Member States, of the one part, and the Republic of Poland, of the other part (hereinafter 'the Europe Agreement`), and in particular Article 105 thereof,Whereas it is laid down in Article 105 (1) and (2) of the Europe Agreement that the Association Council may settle by means of a decision any dispute relating to the application or interpretation of the Europe Agreement;Considering that in view of a critical shortage of raw material in the form of skins and hides the Republic of Poland introduced on1 January 1994 a quota for the export of skins and hides set at 1 400 tonnes for 1994 and 1995 and 3 000 tonnes for 1996, invoking Article 31 of the Europe Agreement;Recognizing that, at the first meeting of the Association Council held in Warsaw on 23 and 24 June 1994, the Community requested Poland to increase the quota to 15 000 tonnes for 1994 and 20 000 tonnes for 1995 in order to maintain a balance, in accordance with the Europe Agreement, between the measures taken by Poland and the real shortage of the raw material existing;Considering that Poland informed the Community that the restriction had been introduced temporarily, was the result of the existing shortage and would be withdrawn as soon as the causes of its implementation disappeared;Considering that both sides have not reached a common understanding;Recognizing that in its letter of 28 July the Community referred the matter to the Association Council in accordance with Article 105 (1) of the Europe Agreement in order that it might settle the dispute;Considering that at the second meeting of the Association Council held in Brussels on 17 July 1995 the Community proposed that the quota for 1995 be raised to 13 500 tonnes;Recognizing that, as Poland could not accept the Community's proposal and as various proposals by Poland to increase the quota had not been accepted by the Community, both sides agreed to the application of Article 105 (4) of the Europe Agreement;Considering that the Republic of Poland and the Community have both notified their arbiters;Considering that, in the meantime the Republic of Poland in its letter of 18 March 1996 submitted a compromise proposal concerning the establishment of a timetable for liberalization of the export of skins and hides which envisages the final withdrawal of restrictions on 1 January 1999 at the latest and provides for another investigation of the matter in 1997 in order to hasten the process of full liberalization by one year;Recognizing that in such circumstances both sides have decided to stop the arbitration procedure provided for in Article 105 (4) and finish it according to Article 105 (2) of the Europe Agreement,HAS DECIDED AS FOLLOWS:

Article 1The amount of the annual quota for exports from Poland of skins and hides, set by Poland at 3 000 tonnes for 1996 shall be increased for the same products to 10 000 tonnes for 1996, 12 000 tonnes for 1997 and 15 000 tonnes for 1998. The Republic of Poland will eliminate the restriction in export of skins and hides with effect from 1 January 1999.

TITLE: Association council between the European Communities and the Republic of Poland

Poland 28

EC association 16agreement

Eastern Europe 14

Cultural Policy 0.05

Page 7: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Visual overview of the process

training

New text preprocessing

DescriptorprofilesDescriptor

profilesDescriptorprofiles

DescriptorDescriptor

Descriptor

Training corpus

preprocessing

assignment

Page 8: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Contents

• Overview of the process• Background

– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space

• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters

• Assignment– “guessing” the descriptors for a new text– …results

Page 9: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Background: Eurovoc thesaurus

• Created for indexing European texts• Created for human use• Contains some abstract concepts (cultural policy)• Hierarchically organised (BT/NT)

Page 10: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Background: Corpus

• Corpus: set of (homogeneous) texts• Collected:

– Parliamentary questions– Debate– Resolution– Protocol– Council regulation– Coucil decision– Council proposition– Agreements and contracts– ...

Page 11: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Background: Corpus

• Our corpus:– About 75000 texts in de,en,fr,it…– About 30000 in fi, sv...– And...

• 22000 in Lithuanian• 8000 in Hungarian• Hopefully 8000, or more, in the new EU languages• ...

Page 12: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Main approaches for thesaurus indexing

• Look for the descriptor text in documents• Linguistic, rule-based approach• Machine learning, statistical approach (JRC)

Page 13: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Look for the descriptor text in documents

• Most intuitive attempt for Eurovoc indexing– Try to find a descriptor text explicitly in documents

• Many texts (69%) do not include descriptor text explicitly

Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.

Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20

Poland

Cultural policy

Community programme

Financing of the community project

Page 14: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Look for the descriptor text in documents(2)

• Many documents do contain some descriptor text without being indexed with it (90%)

Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.

Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20

Culture

Form

Decision

Page 15: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Rule-based approach for Eurovoc indexing

• Manually write a set of rules for each descriptorExample: [Hlava & Hainebach 1996]

• 40000 rules for English– Words and word combinations

» Fishery AND management– Word locations in the text

» Proximity (words in the same sentence…)» Location (title/text/beginning of sentence…)» Format (capital letters)

– Exploit legal references» E.g. “Council directive 79/112/EEC”

• Too expensive, difficult to update

Page 16: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Machine learning

• Inductive process– Tries to “learn” from manually indexed examples– Tries to “reproduce” indexing on new document

• Advantages– Fast and cheap– Easy to adapt to new languages– Easier to update

• Possibility to re-index old documents with new descriptors– More consistent than manual indexing– Ranked assignment (better relevance ranking in search)

Page 17: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Machine learning: our approach

– Trained on manually Eurovoc-indexed documents– Basic and fast representation of texts: “bag of words”

• No consideration of relationship between words (syntax, semantic, discourse…)

– Build a representation of each descriptor • the profile: (weighted) list of the most representative words

Page 18: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Vector space representation

• A text is represented as a vector• The dimensions are the words• A Eurovoc descriptor profile is also a vector

Nuclear MaterialAgreement in the form of an exchange of letters (…)relating to the amendment of the Convention of 20 May 1987on a common transit procedure

See “Vector space” annex for more information

Page 19: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Contents

• Overview of the process• Background

– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space

• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters

• Assignment– “guessing” the descriptors for a new text– …results

Page 20: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Training Eurovoc assignment: text pre-processing

• Why?Find a better representation in the “bag-of-words”– Semantically not relevant to consider some common words

• ‘a’, ‘the’, ‘very’…– Unify word forms

• ‘culture’, ‘cultures’, ’cultural’…– Word sense disambiguation

• ‘pilot’, ‘order’…

Page 21: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Training Eurovoc assignment: text pre-processing

• How?– “Stop word” list, avoid the use of non-meaningful words– Lemmatisation

Replace inflected word forms by their base form (lemma)• ‘towns’ => ‘town’• ‘difficulties’=>’difficulty’• ‘mice’=>’mouse’

– Multi-word unitsReduce polysemy• ‘European_Union’• ‘in_order_to’• ‘sustainable_development ‘• ‘pilot_project’

– Remove annexes and signatures

Page 22: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Text pre-processing exampleArticle 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.

Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20 …

Article 1 Poland shall participate_in the Culture 2000 programme accord_to the term_and_condition set_out in annex_i and II which shall form an integral_part of this Decision .

Article 2 This Decision shall enter_into_force on the day of its adoption . It shall apply_for the duration of the Culture 2000 programme , start from 1 January 2001 .

Page 23: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Machine learning: training

• “learn” from examples• Based on human assignment• Take every text indexed with a given descriptor as the “training

sample” of this descriptor

Poland

Cultural policy

training

training

Poland

PolandproducersPolish

Poland

EuropePolandPolish

Poland

PolandCultureprogrammeArtisticPolish

Cultural policy

Cultureprogrammeeurope

Cultural policy

EuropeCulturalprogramme

Cultural policy

revivalartisticCultureprogramme

Cultural policy

PolandproducersPolish

EuropePolandPolish

PolandCultureprogrammeArtisticPolish

PolandCultureprogrammeArtisticPolish

Cultureprogrammeeurope

EuropeCulturalprogramme

revivalartisticCultureprogramme

Page 24: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Training: representation of texts

• Bag-of-Words representation of text

Article 1Poland shall participate in the Culture 2000 programme according to the terms and conditions set out in Annexes I and II which shall form an integral part of this Decision.

Article 2This Decision shall enter into force on the day of its adoption.It shall apply for the duration of the Culture 2000 programme, starting from 1 January 20

PolandCultureProgrammeDecisionstartingTerms…

Page 25: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Training: identifying most representative words

radioactiveukraineresolutionplutoniumdeuteriumparliamentnuclearblottnitz...

plutoniumdeuteriumassemblynuclearschmidtradioactivekoreaiaea...

Illegal_trafficchernobylradioactiveukrainianplutoniumlithiumdangerousmox...

radioactive (3)plutonium (3) nuclear (2)deuterium (2)Illegal_traffic (1)chernobyl (1)...

+ + =

RADIOACTIVE MATERIALS

Page 26: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Training: identifying most representative words

Poland

Culture 41Cultural 32…Artistic 21Revival 10…

training

training

Cultural policy

Poland

PolandproducersPolish

Poland

EuropePolandPolish

Poland

PolandCultureprogrammeArtisticPolish

Cultural policy

Cultureprogrammeeurope

Cultural policy

EuropeCulturalprogramme

Cultural policy

revivalartisticCultureprogramme

Cultural policy

PolandPolishproducers

Poland 2Polish 2producerseurope

Poland 3Polish 2producersEuropeprogramme

Poland 23Polish 20… Producers 9…

Page 27: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Building Eurovoc descriptor profiles

For each Eurovoc Descriptor– Find the texts it appears in– Find the words appearing in those texts– Combine various weights to compute the

weight of each word for this descriptor– Various normalisations used:

• A very common word has less impact than a rare word– The word ‘contradiction ’ (400 times) is less meaningful than ‘cloud’ (40

times)

• A word used with only one descriptor has higher impact– ‘Chernobyl’ does not appear with many descriptors– ‘redistribute’ (same frequency in texts) appears with various descriptors

• Texts being indexed with one descriptor have better impact than those with 20

Page 28: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Descriptor profile: weight of a word

Weight of a word in a descriptor profile– Based on the frequency of the

word • Number of texts it appears in

• Each text being indexed with Nd

descriptors, word contribution is 1/Nd

– Normalised by the number of

other descriptors it appears in

IDFTFWeight dl .,

1)(log2,

l

DFdl DF

MaxIDF l

IDFNd

WeightdlTt t

dl

,

)1

(,

dlTt t

dl NdTF

,

1,

dlTt

dl NtTF,

1,

)1)((log)1

( 2,

,

l

DF

Tt tdl DF

Max

NdWeight l

dl

ldl DF

MIDF ,

Page 29: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Eurovoc descriptor profile

• List of weighted words (associates)

TOWN PLANNING

Associate Weight

urban_policy 3.98

urban 3.40

urban_renewal 3.18

derelict 3.02

urban_development

2.90

run_down 2.79

city 2.77

urban_plan 2.70

revitalize 2.60

town 2.38

esdp 2.20

conurbations 2.20

heritage 1.88

spatial_development

1.87

interregional 1.87

quality_of_life 1.82

regeneration 1,8

urban_environment

1,77

spatial 1,76

planning 1,75

urban_area 1,74

regional_plan 1,74

Page 30: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Associate List: RADIOACTIVE MATERIALS

Page 31: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

fishery-related

management-related

Associate List: FISHERY MANAGEMENT

Page 32: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Tuning: various parameters

– Minimum size and number of training texts available for each descriptor. We chose to require at least 5 texts (with at least 2000 characters each).

– Select words in texts: log-likelihood formula (p-value to the low value of 0.15 to produce long associate lists)

– Reference corpus for the log-likelihood formula. A general corpus vs. the training corpus

– Meta-text vs individual texts– Minimum number of texts per descriptor for which the word is an

associate (a word has to appear in at least 2 texts to be an associate)– Use: number of texts / cumulated frequency of word / log-likelihood value– Impact of all associates occurring at least 10% as often as the most

common associate word – We do not consider the length of each training text – Minimum weight threshold for each associate– Minimum length of the associate list

Page 33: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Contents

• Overview of the process• Background

– Eurovoc Thesaurus– Corpus of texts– Approaches to thesaurus indexing– Vector space

• Training– Pre-processing the texts– Building Eurovoc profiles– Tuning various parameters

• Assignment– “guessing” the descriptors for a new text– …results

Page 34: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Assignment Phase• Normalise new document

(lemmatise, multi-word mark-up)• Produce word frequency list

(excluding stop words)

Calculate similarity between word frequency list and descriptor associate lists, using statistical formulae

...

Page 35: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Eurovoc assignment: example

Resolution on human rights in Ethiopia

ETHIOPIA 30%HUMAN RIGHTS 25%POLITICAL VIOLENCE 19%REPRESSION 18%DEMOCRATIZATION 18%…EXTREMISM 10%DEATH PENALTY 10%…TREATY ON EUROPEAN UNION 6%

Ethiopia, ethiopianhuman_rights, (condemn, respect…)human_rights, (condemn, killing…)human_rights, (condemn, repression…)human_rights, (condemn, democratic..)

human_rights, (condemn, call_on…)human_rights, (condemn, call_on…)

human_rights, (citizen, respect…)

Document:

Page 36: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

(Eurovoc assignment in vector space)

Keyword 1

Keyword 3

Keyword 2

Descriptor 1

Document

Descriptor 2

Page 37: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Formulae tested for descriptor assignment

)1).((log2,, l

dldl DF

NTFTFIDF

dl tltldl

tdltldl

TFIDFTFIDF

TFIDFTFIDFtdCOSINE

)).((

.),(

2

,

2

,

,,

M

dTF

TF

DF

DFNOkapi

dl

dl

dtl l

ldt

,

,, )log(

)max(18.0

)max(21.0

)max(61.0

Sproduct

Sproduct

Okapi

Okapi

COSINE

COSINE

tdl

tldl TFIDFTFIDFtdSproduct ,, .),(

Cosine uses TF.IDF; computes the angle of two multi-dimensional vectors (of the document (t) and of the descriptor associate list)

Term Frequency, Inverse Document Frequency Considers occurrence frequency of lemma (l) in meta-text (TFl,t) and number of descriptors (d) for which the lemma is an associate (DFl)

Okapi considers occurrence frequency of lemma as an associate (DFl); the number of associates in the associate list (size, |d|); the average size of descriptor associate lists (M); the total number of descriptors used (N)

‘622’ mixed formula, uses all of the above

‘Scalar Product’ adds product of TF.IDF values of associates and text lemmas

Page 38: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Sample Assignment ResultTitle: Legislative resolution embodying Parliament's opinion on the proposal for a Council

Regulation amending Regulation N. 2847/93 establishing a control system applicable to the common fisheries policy (COM(95)0256 - C4-0272/95 - 95/ 0146) (Consultation procedure)

Page 39: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Results: …starting from a plain text…Resolution on human rights in EthiopiaThe European Parliament,- having regard to its resolution of 18 July 1996 on human rights in Ethiopia ((OJ C 261, 9.9.1996, p. 166.)),A. whereas respect for human rights, democratic principles and the rule of law constitute essential elements of the revised Lomé IV Convention and whereas the Ethiopian constitution also includes respect for human rights,B. having regard to the continuing process of democratic and institutional change in Ethiopia,C. concerned by the repression of civil society associations which recently forced into exile leaders like Mr Kefale Mamo and Mr Mulugeta Lule (president and vice-president respectively of the Ethiopian Free Press Journalist Association), Mr Gemorav Kassa (General Secretary of the Ethiopian Teachers Association), and Mr Dawi Ibrahim (president of the Confederation of Ethiopian Trade Unions),D. deeply concerned by the killing on 11 June 1997 of Assefa Maru, an executive board member of both the Ethiopian Teachers Association and of the Ethiopian Human Rights Council,1. Condemns the killing of Assefa Maru;2. Condemns all human rights violations committed by the government and military forces;3. Calls on the Ethiopian authorities to guarantee the fundamental rights of all Ethiopian citizens and to put an end to politically motivated persecutions and to abuses such as extrajudicial disappearances, torture, detention, rapes and arrests, in accordance with the Ethiopian constitution;4. Calls on the Ethiopian authorities fully to respect freedom of the press, independence of unions and the right of association of citizens;5. Urges the Ethiopian Government to release all prisoners of conscience and to provide corrective procedures to the judiciary system whereby people can be charged and tried in a fair way;6. Calls on the Council, the Commission and the Member States to monitor closely human rights in Ethiopia and use all means to improve the situation;7. Instructs its President to forward this resolution to the Council, the Commission, the Government of Ethiopia, the Secretary- General of the United Nations and the UN High Commissioner for Human Rights.

New text

Page 40: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Pre-processing and keyword extractionNew text

preprocessing

Page 41: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

DescriptorDescriptor

Descriptor

assignment

Page 42: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

What next?

• New languages• Experiments, using output of current assignment

– SVM– Other categorization techniques

• Filter output – Geographic descriptors

Page 43: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Automatic Eurovoc Indexing: vector model

Bruno PouliquenJoint Research Centre, European Commission

Ispra-Italy

http://www.jrc.cec.eu.int/langtech

Addressing the Language Barrier Problem in the Enlarged EU

Automating Eurovoc Descriptor Assignment

Page 44: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Annex: Vector space representation

Word 1

Word 3

Word 2

Document

Page 45: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Vector space representation

Poland

Culture

Programme

EU-Poland culture 2000

Example: document in a three dimensional space

Page 46: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Vector space representation

Cultural policy

Poland

Culture

Programme

A Eurovoc descriptor in the same three dimensional space

Page 47: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Vector space representation

Keyword 1

Keyword 3

Keyword 2

Eurovoc descriptor

Document

Eurovoc descriptor and documents comparison

Page 48: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Vector space representation

Cultural policy

Poland

Culture

Programme

EU-Poland culture 2000

Eurovoc descriptor and documents comparison

Page 49: Automatic Eurovoc Indexing Bruno Pouliquen Joint Research Centre, European Commission Ispra-Italy  Addressing the Language

Eurovoc assignment in vector space

• A text is “compared” to each Eurovoc descriptor profile

Keyword 1

Keyword 3

Keyword 2

Descriptor 1

Document

Descriptor 2

dtl

tldl WeightWeightdtsim ,,),(